You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bill Bell <bi...@gmail.com> on 2015/01/03 09:28:43 UTC

Re: How large is your solr index?

For Solr 5 why don't we switch it to 64 bit ??

Bill Bell
Sent from mobile


> On Dec 29, 2014, at 1:53 PM, Jack Krupansky <ja...@gmail.com> wrote:
> 
> And that Lucene index document limit includes deleted and updated
> documents, so even if your actual document count stays under 2^31-1,
> deleting and updating documents can push the apparent document count over
> the limit unless you very aggressively merge segments to expunge deleted
> documents.
> 
> -- Jack Krupansky
> 
> -- Jack Krupansky
> 
> On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson <er...@gmail.com>
> wrote:
> 
>> When you say 2B docs on a single Solr instance, are you talking only one
>> shard?
>> Because if you are, you're very close to the absolute upper limit of a
>> shard, internally
>> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.
>> 
>> But yeah, your 100B documents are going to use up a lot of servers...
>> 
>> Best,
>> Erick
>> 
>> On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam <br...@intix.eu>
>> wrote:
>>> Hi folks,
>>> 
>>> I'm trying to get a feel of how large Solr can grow without slowing down
>> too
>>> much. We're looking into a use-case with up to 100 billion documents
>>> (SolrCloud), and we're a little afraid that we'll end up requiring 100
>>> servers to pull it off.
>>> 
>>> The largest index we currently have is ~2billion documents in a single
>> Solr
>>> instance. Documents are smallish (5k each) and we have ~50 fields in the
>>> schema, with an index size of about 2TB. Performance is mostly OK. Cold
>>> searchers take a while, but most queries are alright after warming up. I
>>> wish I could provide more statistics, but I only have very limited
>> access to
>>> the data (...banks...).
>>> 
>>> I'd very grateful to anyone sharing statistics, especially on the larger
>> end
>>> of the spectrum -- with or without SolrCloud.
>>> 
>>> Thanks,
>>> 
>>> - Bram
>> 

Re: How large is your solr index?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/3/2015 9:02 AM, Erick Erickson wrote:
> bq: For Solr 5 why don't we switch it to 64 bit ??
> 
> -1 on this for a couple of reasons
>> it'd be pretty invasive, and 5.0 may be imminent. Far too big a change to implement at the last second
>> It's not clear that it's even useful. Once you get to that many documents, performance usually suffers
> 
> Of course I wouldn't be doing the work so I really don't have much of
> a vote, but it's not clear to me at
> all that enough people would actually have a use-case for 2b+ docs in
> a single shard to make it
> worthwhile. At that scale GC potentially becomes really unpleasant for
> instance....

I agree, 2 billion documents in a single index is MORE than enough.  If
you actually create an index that large, you're going to have
performance problems, and most of those performance problems will likely
be related to garbage collection.  I can extrapolate one such problem
from personal experience on a much smaller index.

A filterCache entry for a 2 billion document index is 256MB in size.
Assuming you're using the G1 collector, the maximum size for a G1 heap
region is 32MB, which means that at that size, every single filter will
result in an object that is allocated immediately from the old
generation (it's called a humongous allocation).  Allocating that much
memory from the old generation will eventually (and frequently) result
in a full garbage collection ... and you do not want your application to
wait for a full garbage collection on the heap size that would be
required for a 2 billion document index.  It could easily exceed 30 or
60 seconds.

When you consider the current limitations of G1GC, it would be advisable
to keep each Solr index below 100 million documents.  At 134,217,728
documents, each filter object will be too large (more than 16MB) to be
considered a normal allocation on the max heap region size (32MB).

Even with the older battle-tested CMS collector (assuming good tuning
options), I think the huge object sizes (and the huge number of smaller
objects) resulting from a 2 billion document index will have major
garbage collection problems.

Thanks,
Shawn


Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
bq: I'm wondering if anyone has been using SolrCloud with HDFS at large scales

Absolutely, there are several companies doing this, see Lucidworks and
Cloudera for two instances.

Solr itself has the MapReduceIndexerTool for indexing to Solr's
running on HDFS FWIW.

About needing 3x the memory.. simply splitting shards is only _part_
of the solution as you're seeing.
Next bit would be to move the sub-shards on to other metal, pointing
at the HDFS index to be sure.

FWIW,
Erick



On Wed, Jan 7, 2015 at 9:50 AM, Joseph Obernberger
<jo...@lovehorsepower.com> wrote:
> Kinda late to the party on this very interesting thread, but I'm wondering
> if anyone has been using SolrCloud with HDFS at large scales?  We really
> like this capability since our data is inside of Hadoop and we can run the
> Solr shards on the same nodes, and we only need to manage one pool of
> storage (HDFS).  Our current setup consists of 11 systems (bare metal -
> although our cluster is actually 16 nodes not all run Solr) with a 2.9TByte
> index and just under 900 million docs spread across 22 shards (2 shards per
> physical box) in a single collection.  We index around 3 million docs per
> day.
>
> A typical query takes about 7 seconds to run, but we also do faceting and
> clustering.  Those can take in the 3 - 5 minute range depends on what was
> queried, but can be as little as 10 seconds. The index contains about 100
> fields.
> We are looking at switching to a different method of indexing our data which
> will involve a much larger number of fields, and very little stored in the
> index (index only) to help improve performance.
>
> I've used the SHARDSPLIT with success, but the server doing the split needs
> to have triple the amount of direct memory when using HDFS as one node needs
> three X the amount because it will be running three shards.  This can lead
> to 'swap hell' if you're not careful. On large indexes, the split can take a
> very long time to run; much longer than the REST timeout, but can be
> monitored by checking zookeeper's clusterstate.json.
>
> -Joe
>
>
>
> On 1/7/2015 4:25 AM, Bram Van Dam wrote:
>>
>> On 01/06/2015 07:54 PM, Erick Erickson wrote:
>>>
>>> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>>> API command?
>>
>>
>> I think that's the direction we'll probably be going. Index size (at least
>> for us) can be unpredictable in some cases. Some clients start out small and
>> then grow exponentially, while others start big and then don't grow much at
>> all. Starting with SolrCloud would at least give us that flexibility.
>>
>> That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
>> certain size, it would be better for us to simply add an extra shard,
>> without splitting.
>>
>>
>>> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <pe...@gmail.com>
>>> wrote:
>>>>
>>>> ++1 for the automagic shard creator. We've been looking into doing this
>>>> sort of thing internally - i.e. when a shard reaches a certain size/num
>>>> docs, it creates 'sub-shards' to which new commits are sent and queries
>>>> to
>>>> the 'parent' shard are included. The concept works, as long as you don't
>>>> try any non-dist stuff - it's one reason why all our fields are always
>>>> single valued.
>>
>>
>> Is there a problem with multi-valued fields and distributed queries?
>>
>>>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
>>>> the
>>>> parent shard then stops suffering from auto-warming latency due to
>>>> commits
>>>> (we do a fair amount of committing). In theory, you could carry on
>>>> sub-sharding until your hardware starts gasping for air.
>>
>>
>> Sounds like you're doing something similar to us. In some cases we have a
>> hard commit every minute. Keeping the caches hot seems like a very good
>> reason to send data to a specific shard. At least I'm assuming that when you
>> add documents to a single shard and commit; the other shards won't be
>> impacted...
>>
>>  - Bram
>>
>>
>

Re: How large is your solr index?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/8/2015 9:39 AM, Joseph Obernberger wrote:
> Yes - it would be 20GBytes of cache per 270GBytes of data.

That's not a lot of cache.  One rule of thumb is that you should have at
least 50% of the index size available as cache, with 100% being a lot
better.  The caching should happen on the Solr server itself so there
isn't a network bottleneck.  This is one of several reasons why local
storage on regular filesystems is preferred for Solr.

> We've tried lower Xmx but we get OOM errors during faceting of large
> datasets.  Right now we're running two JVMs per physical box (2 shards
> per box), but we're going to be changing that to on JVM and one shard
> per box.

This wiki page has some info on what can cause high heap requirements
and some general ideas for what you can do about it:

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If you want to discuss your specific situation, we can use the list,
direct email, or the #solr IRC channel.

http://wiki.apache.org/solr/IRCChannels

Thanks,
Shawn


Re: How large is your solr index?

Posted by Joseph Obernberger <jo...@lovehorsepower.com>.
On 1/8/2015 3:16 AM, Toke Eskildsen wrote:
> On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:
>> Thank you Toke - yes - the data is indexed throughout the day.  We are
>> handling very few searches - probably 50 a day; this is an R&D system.
> If your searches are in small bundles, you could pause the indexing flow
> while the searches are executed, for better performance.
>
>> Our HDFS cache, I believe, is too small at 10GBytes per shard.
> That depends a lot on your corpus, your searches and underlying storage.
> But with our current level of information, it is a really good bet:
> Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
> drives.
Yes - it would be 20GBytes of cache per 270GBytes of data.
>
>> Current parameters for running each shard are:
>> JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3
> [...]
>> -Xmx10752m"
> One Solr/shard? You could probably win a bit by having one Solr/machine
> instead. Anyway, it's quite a high Xmx, but I presume you have measured
> the memory needs.
We've tried lower Xmx but we get OOM errors during faceting of large 
datasets.  Right now we're running two JVMs per physical box (2 shards 
per box), but we're going to be changing that to on JVM and one shard 
per box.
>
>> I'd love to try SSDs, but don't have the budget at present to go that
>> route.
> We find the price/performance for SSD + moderate RAM to be quite a
> better deal than spinning drives + a lot of RAM, even when buying
> enterprise hardware. For consumer SSDs (used in our large server) it is
> even cheaper to use SSDs. It all depends on use pattern of course, but
> your setup with non-concurrent searches seems like it would fit well.
>
> Note: I am sure that the RAM == index size would deliver very high
> performance. With enough RAM you can use tape to hold the index. Whether
> it is cost effective is another matter.
Ha!  Yes - our index is accessible via a 2400 baud modem, but we have 
lots of cache!  ;)
>
>> I'd really like to get the HDFS option to work well as it
>> reduces system complexity.
> That is very understandable. We examined the option of networked storage
> (Isilon) with underlying spindles, and it performed adequately for our
> needs up to 2-3TB of index data. Unfortunately the heavy random read
> load from Solr meant a noticeable degradation of other services using
> the networked storage. I am sure it could be solved with more
> centralized hardware, but in the end we found it cheaper and simpler to
> use local storage for search. This will of course differ across
> organizations and setups.

We're going to experiment with the one shard per box and more RAM cache 
per shard and see where that gets us; we'll also be adding more shards.
Thanks for the tips!
Interesting that you mention Isilon as we're planning on doing an eval 
with their product this year where we'll be testing out their HDFS 
layer.  It's a potential way to balance computer and storage since you 
can add HDFS storage without adding compute.

>
> - Toke Eskildsen
>
>
>
-Joe

Re: How large is your solr index?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:
> Thank you Toke - yes - the data is indexed throughout the day.  We are 
> handling very few searches - probably 50 a day; this is an R&D system.

If your searches are in small bundles, you could pause the indexing flow
while the searches are executed, for better performance. 

> Our HDFS cache, I believe, is too small at 10GBytes per shard.

That depends a lot on your corpus, your searches and underlying storage.
But with our current level of information, it is a really good bet:
Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
drives.

> Current parameters for running each shard are:
> JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 
[...]
> -Xmx10752m"

One Solr/shard? You could probably win a bit by having one Solr/machine
instead. Anyway, it's quite a high Xmx, but I presume you have measured
the memory needs.

> I'd love to try SSDs, but don't have the budget at present to go that 
> route.

We find the price/performance for SSD + moderate RAM to be quite a
better deal than spinning drives + a lot of RAM, even when buying
enterprise hardware. For consumer SSDs (used in our large server) it is
even cheaper to use SSDs. It all depends on use pattern of course, but
your setup with non-concurrent searches seems like it would fit well.

Note: I am sure that the RAM == index size would deliver very high
performance. With enough RAM you can use tape to hold the index. Whether
it is cost effective is another matter.

> I'd really like to get the HDFS option to work well as it 
> reduces system complexity.

That is very understandable. We examined the option of networked storage
(Isilon) with underlying spindles, and it performed adequately for our
needs up to 2-3TB of index data. Unfortunately the heavy random read
load from Solr meant a noticeable degradation of other services using
the networked storage. I am sure it could be solved with more
centralized hardware, but in the end we found it cheaper and simpler to
use local storage for search. This will of course differ across
organizations and setups.

- Toke Eskildsen



Re: How large is your solr index?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/7/2015 2:26 PM, Joseph Obernberger wrote:
> Thank you Toke - yes - the data is indexed throughout the day.  We are
> handling very few searches - probably 50 a day; this is an R&D system.
> Our HDFS cache, I believe, is too small at 10GBytes per shard.  This
> comes out to 20GBytes of HDFS cache per physical machine plus about
> 10G each for the 2 JVMs running the shards.  Each of those machines is
> also running other services which leaves very little RAM available for
> FS cache.
>
> Current parameters for running each shard are:
> JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages
> -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90
> -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC
> -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m
> -XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSTriggerPermRatio=80
> -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled
> -XX:+ParallelRefProcEnabled -XX:+AggressiveOpts
> -XX:ParallelGCThreads=7 -Xmx10752m"
>
> I'd love to try SSDs, but don't have the budget at present to go that
> route.  I'd really like to get the HDFS option to work well as it
> reduces system complexity.  It seems to me that if our HDFS cluster
> has lots/enough spindles, performance should be relatively good, as
> long as the OS can actually do some caching.  We will be adding more
> HDFS nodes in the future, increasing spindle count and reducing the
> amount of data stored into Solr.  When we redo our Solr Cloud, we will
> only run one shard per box, and supply more HDFS cache.

I can make very little comment about HDFS, because I've never used it. 
I can say that you want enough memory such that the data can be fully
cached in the memory on the Solr machine.  If you're in a situation
where caching happens on the HDFS servers but then has to cross the
network to get to Solr, then you'll have your network as a bottleneck
... a gigabit LAN is far slower than local RAM, and tends to be even
slower than modern high-capacity disks, too.

When it comes to GC options, I do have recent and relevant experience. 
Your GC options look a lot like the CMS options that I have been
advising for quite a while ... but recently I have been getting better
results with G1 and some specific tuning options on the latest Java
versions.

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks,
Shawn


Re: How large is your solr index?

Posted by Joseph Obernberger <jo...@lovehorsepower.com>.
Thank you Toke - yes - the data is indexed throughout the day.  We are 
handling very few searches - probably 50 a day; this is an R&D system.
Our HDFS cache, I believe, is too small at 10GBytes per shard.  This 
comes out to 20GBytes of HDFS cache per physical machine plus about 10G 
each for the 2 JVMs running the shards.  Each of those machines is also 
running other services which leaves very little RAM available for FS cache.

Current parameters for running each shard are:
JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 
-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 
-XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m 
-XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=70 -XX:CMSTriggerPermRatio=80 
-XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled 
-XX:+ParallelRefProcEnabled -XX:+AggressiveOpts -XX:ParallelGCThreads=7 
-Xmx10752m"

I'd love to try SSDs, but don't have the budget at present to go that 
route.  I'd really like to get the HDFS option to work well as it 
reduces system complexity.  It seems to me that if our HDFS cluster has 
lots/enough spindles, performance should be relatively good, as long as 
the OS can actually do some caching.  We will be adding more HDFS nodes 
in the future, increasing spindle count and reducing the amount of data 
stored into Solr.  When we redo our Solr Cloud, we will only run one 
shard per box, and supply more HDFS cache.

-Joe

On 1/7/2015 3:50 PM, Toke Eskildsen wrote:
> Joseph Obernberger [joeo@lovehorsepower.com] wrote:
>
> [HDFS, 9M docs, 2.9TB, 22 shards, 11 bare metal boxes]
>
>> A typical query takes about 7 seconds to run, but we also do faceting
>> and clustering.  Those can take in the 3 - 5 minute range depends on
>> what was queried, but can be as little as 10 seconds. The index contains
>> about 100 fields.
> 7 seconds without faceting seems like a long time. I am guessing your 3M daily updates are spread throughout the day, instead of being a nightly batch job? How many concurrent searches are you handling?
>
> We have no experience with HDFS for Solr indexes, but a quick check indicates that it is not a good fit for Solr. At least not out of the box: http://hbase.apache.org/book.html#perf.hdfs.curr
>
> We did at one point try to use networked storage for our index. That meant 1/3 performance, compared to local storage, but of course your mileage will vary. As you are looking into ways of improving performance, what about testing the performance difference with local storage (SSD of course)?
>
> - Toke Eskildsen
>


RE: How large is your solr index?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Joseph Obernberger [joeo@lovehorsepower.com] wrote:

[HDFS, 9M docs, 2.9TB, 22 shards, 11 bare metal boxes]

> A typical query takes about 7 seconds to run, but we also do faceting
> and clustering.  Those can take in the 3 - 5 minute range depends on
> what was queried, but can be as little as 10 seconds. The index contains
> about 100 fields.

7 seconds without faceting seems like a long time. I am guessing your 3M daily updates are spread throughout the day, instead of being a nightly batch job? How many concurrent searches are you handling?

We have no experience with HDFS for Solr indexes, but a quick check indicates that it is not a good fit for Solr. At least not out of the box: http://hbase.apache.org/book.html#perf.hdfs.curr

We did at one point try to use networked storage for our index. That meant 1/3 performance, compared to local storage, but of course your mileage will vary. As you are looking into ways of improving performance, what about testing the performance difference with local storage (SSD of course)?

- Toke Eskildsen

Re: How large is your solr index?

Posted by Joseph Obernberger <jo...@lovehorsepower.com>.
Kinda late to the party on this very interesting thread, but I'm 
wondering if anyone has been using SolrCloud with HDFS at large scales?  
We really like this capability since our data is inside of Hadoop and we 
can run the Solr shards on the same nodes, and we only need to manage 
one pool of storage (HDFS).  Our current setup consists of 11 systems 
(bare metal - although our cluster is actually 16 nodes not all run 
Solr) with a 2.9TByte index and just under 900 million docs spread 
across 22 shards (2 shards per physical box) in a single collection.  We 
index around 3 million docs per day.

A typical query takes about 7 seconds to run, but we also do faceting 
and clustering.  Those can take in the 3 - 5 minute range depends on 
what was queried, but can be as little as 10 seconds. The index contains 
about 100 fields.
We are looking at switching to a different method of indexing our data 
which will involve a much larger number of fields, and very little 
stored in the index (index only) to help improve performance.

I've used the SHARDSPLIT with success, but the server doing the split 
needs to have triple the amount of direct memory when using HDFS as one 
node needs three X the amount because it will be running three shards.  
This can lead to 'swap hell' if you're not careful. On large indexes, 
the split can take a very long time to run; much longer than the REST 
timeout, but can be monitored by checking zookeeper's clusterstate.json.

-Joe


On 1/7/2015 4:25 AM, Bram Van Dam wrote:
> On 01/06/2015 07:54 PM, Erick Erickson wrote:
>> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>> API command?
>
> I think that's the direction we'll probably be going. Index size (at 
> least for us) can be unpredictable in some cases. Some clients start 
> out small and then grow exponentially, while others start big and then 
> don't grow much at all. Starting with SolrCloud would at least give us 
> that flexibility.
>
> That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a 
> certain size, it would be better for us to simply add an extra shard, 
> without splitting.
>
>
>> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge 
>> <pe...@gmail.com> wrote:
>>> ++1 for the automagic shard creator. We've been looking into doing this
>>> sort of thing internally - i.e. when a shard reaches a certain size/num
>>> docs, it creates 'sub-shards' to which new commits are sent and 
>>> queries to
>>> the 'parent' shard are included. The concept works, as long as you 
>>> don't
>>> try any non-dist stuff - it's one reason why all our fields are always
>>> single valued.
>
> Is there a problem with multi-valued fields and distributed queries?
>
>>> A cool side-effect of sub-sharding (for lack of a snappy term) is 
>>> that the
>>> parent shard then stops suffering from auto-warming latency due to 
>>> commits
>>> (we do a fair amount of committing). In theory, you could carry on
>>> sub-sharding until your hardware starts gasping for air.
>
> Sounds like you're doing something similar to us. In some cases we 
> have a hard commit every minute. Keeping the caches hot seems like a 
> very good reason to send data to a specific shard. At least I'm 
> assuming that when you add documents to a single shard and commit; the 
> other shards won't be impacted...
>
>  - Bram
>
>


Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
You shouldn't _have_ to keep track of this yourself since Solr 4.4,
see SOLR-4965 and the associated Lucene JIRA. Those are supposed to
make issuing a commit on an index that hasn't changed a no-op.

If you do issue commits and do open new searchers when the index has
NOT changed, it's worth a JIRA.

FWIW,
Erick

On Wed, Jan 7, 2015 at 1:17 PM, Peter Sturge <pe...@gmail.com> wrote:
>> Is there a problem with multi-valued fields and distributed queries?
>
>> No. But there are some components that don't do the right thing in
>> distributed mode, joins for instance. The list is actually quite small and
>> is getting smaller all the time.
>
> Yes, joins is the main one. There used to be some dist constraints on
> grouping, but that might be from the 3.x days of field collapsing.
>
>> Sounds like you're doing something similar to us. In some cases we have a
>> hard commit every minute. Keeping the caches hot seems like a very good
>> reason to send data to a specific shard. At least I'm assuming that when
> you
>> add documents to a single shard and commit; the other shards won't be
>> impacted...
>
>> Not true if the other shards have had any indexing activity. The commit is
>> usually forwarded to all shards. If the individual index on a
>> particular shard is
>> unchanged then it should be a no-op though.
>
> This is an excellent point, and well worth taking some care on.
> We do it by indexing to a number of shards, and only commit to those that
> actually have something to commit - although an empty commit might be a
> no-op on the indexing side, it's not on the automwarming/faceting side -
> care needs to be taken so that you don't hose your caches unnecessarily.
>
>
> On Wed, Jan 7, 2015 at 4:42 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> See below:
>>
>>
>> On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam <br...@intix.eu> wrote:
>> > On 01/06/2015 07:54 PM, Erick Erickson wrote:
>> >>
>> >> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>> >> API command?
>> >
>> >
>> > I think that's the direction we'll probably be going. Index size (at
>> least
>> > for us) can be unpredictable in some cases. Some clients start out small
>> and
>> > then grow exponentially, while others start big and then don't grow much
>> at
>> > all. Starting with SolrCloud would at least give us that flexibility.
>> >
>> > That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
>> certain
>> > size, it would be better for us to simply add an extra shard, without
>> > splitting.
>> >
>>
>> True, and you can do this if you take explicit control of the document
>> routing, but...
>> that's quite tricky. You forever after have to send any _updates_ to the
>> same
>> shard you did the first time, whereas SPLITSHARD will "do the right thing".
>>
>> >
>> >> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <pe...@gmail.com>
>> >> wrote:
>> >>>
>> >>> ++1 for the automagic shard creator. We've been looking into doing this
>> >>> sort of thing internally - i.e. when a shard reaches a certain size/num
>> >>> docs, it creates 'sub-shards' to which new commits are sent and queries
>> >>> to
>> >>> the 'parent' shard are included. The concept works, as long as you
>> don't
>> >>> try any non-dist stuff - it's one reason why all our fields are always
>> >>> single valued.
>> >
>> >
>> > Is there a problem with multi-valued fields and distributed queries?
>>
>> No. But there are some components that don't do the right thing in
>> distributed mode, joins for instance. The list is actually quite small and
>> is getting smaller all the time.
>>
>> >
>> >>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
>> >>> the
>> >>> parent shard then stops suffering from auto-warming latency due to
>> >>> commits
>> >>> (we do a fair amount of committing). In theory, you could carry on
>> >>> sub-sharding until your hardware starts gasping for air.
>> >
>> >
>> > Sounds like you're doing something similar to us. In some cases we have a
>> > hard commit every minute. Keeping the caches hot seems like a very good
>> > reason to send data to a specific shard. At least I'm assuming that when
>> you
>> > add documents to a single shard and commit; the other shards won't be
>> > impacted...
>>
>> Not true if the other shards have had any indexing activity. The commit is
>> usually forwarded to all shards. If the individual index on a
>> particular shard is
>> unchanged then it should be a no-op though.
>>
>> But the usage pattern here is its own bit of a trap. If all your
>> indexing is going
>> to a single shard, then also the entire indexing _load_ is happening on
>> that
>> shard. So the CPU utilization will be higher on that shard than the older
>> ones.
>> Since distributed requests need to get a response from every shard before
>> returning to the client, the response time will be bounded by the response
>> from
>> the slowest shard and this may actually be slower. Probably only noticeable
>> when the CPU is maxed anyway though.
>>
>>
>>
>> >
>> >  - Bram
>> >
>>

Re: How large is your solr index?

Posted by Peter Sturge <pe...@gmail.com>.
> Is there a problem with multi-valued fields and distributed queries?

> No. But there are some components that don't do the right thing in
> distributed mode, joins for instance. The list is actually quite small and
> is getting smaller all the time.

Yes, joins is the main one. There used to be some dist constraints on
grouping, but that might be from the 3.x days of field collapsing.

> Sounds like you're doing something similar to us. In some cases we have a
> hard commit every minute. Keeping the caches hot seems like a very good
> reason to send data to a specific shard. At least I'm assuming that when
you
> add documents to a single shard and commit; the other shards won't be
> impacted...

> Not true if the other shards have had any indexing activity. The commit is
> usually forwarded to all shards. If the individual index on a
> particular shard is
> unchanged then it should be a no-op though.

This is an excellent point, and well worth taking some care on.
We do it by indexing to a number of shards, and only commit to those that
actually have something to commit - although an empty commit might be a
no-op on the indexing side, it's not on the automwarming/faceting side -
care needs to be taken so that you don't hose your caches unnecessarily.


On Wed, Jan 7, 2015 at 4:42 PM, Erick Erickson <er...@gmail.com>
wrote:

> See below:
>
>
> On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam <br...@intix.eu> wrote:
> > On 01/06/2015 07:54 PM, Erick Erickson wrote:
> >>
> >> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
> >> API command?
> >
> >
> > I think that's the direction we'll probably be going. Index size (at
> least
> > for us) can be unpredictable in some cases. Some clients start out small
> and
> > then grow exponentially, while others start big and then don't grow much
> at
> > all. Starting with SolrCloud would at least give us that flexibility.
> >
> > That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
> certain
> > size, it would be better for us to simply add an extra shard, without
> > splitting.
> >
>
> True, and you can do this if you take explicit control of the document
> routing, but...
> that's quite tricky. You forever after have to send any _updates_ to the
> same
> shard you did the first time, whereas SPLITSHARD will "do the right thing".
>
> >
> >> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <pe...@gmail.com>
> >> wrote:
> >>>
> >>> ++1 for the automagic shard creator. We've been looking into doing this
> >>> sort of thing internally - i.e. when a shard reaches a certain size/num
> >>> docs, it creates 'sub-shards' to which new commits are sent and queries
> >>> to
> >>> the 'parent' shard are included. The concept works, as long as you
> don't
> >>> try any non-dist stuff - it's one reason why all our fields are always
> >>> single valued.
> >
> >
> > Is there a problem with multi-valued fields and distributed queries?
>
> No. But there are some components that don't do the right thing in
> distributed mode, joins for instance. The list is actually quite small and
> is getting smaller all the time.
>
> >
> >>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
> >>> the
> >>> parent shard then stops suffering from auto-warming latency due to
> >>> commits
> >>> (we do a fair amount of committing). In theory, you could carry on
> >>> sub-sharding until your hardware starts gasping for air.
> >
> >
> > Sounds like you're doing something similar to us. In some cases we have a
> > hard commit every minute. Keeping the caches hot seems like a very good
> > reason to send data to a specific shard. At least I'm assuming that when
> you
> > add documents to a single shard and commit; the other shards won't be
> > impacted...
>
> Not true if the other shards have had any indexing activity. The commit is
> usually forwarded to all shards. If the individual index on a
> particular shard is
> unchanged then it should be a no-op though.
>
> But the usage pattern here is its own bit of a trap. If all your
> indexing is going
> to a single shard, then also the entire indexing _load_ is happening on
> that
> shard. So the CPU utilization will be higher on that shard than the older
> ones.
> Since distributed requests need to get a response from every shard before
> returning to the client, the response time will be bounded by the response
> from
> the slowest shard and this may actually be slower. Probably only noticeable
> when the CPU is maxed anyway though.
>
>
>
> >
> >  - Bram
> >
>

Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
bq: I still feel like shard management could be made easier. I'll see
if I can have a look at JIRA and try to pitch in.

_nobody_ will disagree here! It's been a time and interest thing so far...

Automating it all is tempting, but.... my experience so far indicates
that once people get to large scales they want to be able to take some
explicit control of where replicas land and what nodes collections are
hosted on, often having heterogenous collections hosted on the same
hardware, making it very difficult to automate entirely.

How's your Angular JS? See SOLR-5507, there's work being done to
rewrite the Admin UI in Angular JS. Since I totally avoid UI work I
don't have much insight into the pros/cons, so I trust the people
putting fingers to keyboards. I'm sure Upayavira would welcome any
help there.

What would be waaay cool and a step in the path to full automation
wold be the ability to access the collections API through the UI. The
whole collections API is really not known by the current admin UI,
which has its roots in the single-core, no-shard world....

I should add that this is a place to _start_ IMO, "low hanging fruit"
and all that. But once the collections API is accessible, I can
imagine all sorts of stuff layered on top. But just providing a
cloud-friendly way to use the collections API calls would be a great
place to start.

I like to define a short-term goal, just to get things started, so
here's one (feel free to ignore of course!).

I want a way to see all the nodes in my system that have running
Solrs, whether or not they host any existing Solr replicas. I want to
click on any number of them and create a new collection hosted on
those nodes. Really, this is just "click on a bunch of nodes that are
then the nodeSet for the Collections CREATE action", along with a
dialog box for all the CREATE options.

FWIW,
Erick



On Sun, Jan 11, 2015 at 8:23 AM, Bram Van Dam <br...@intix.eu> wrote:
>> Do note that one strategy is to create more shards than you need at
>> the beginning. Say you determine that 10 shards will work fine, but
>> you expect to grow your corpus by 2x. _Start_  with 20 shards
>> (multiple shards can be hosted in the same JVM, no problem, see
>> maxShardsPerNode in the collections API CREATE action. Then
>> as your corpus grows you can move the shards to their own
>> boxes.
>
>
> I guess planning ahead is something we can do. We usually have a pretty good
> idea of how large our indexes are going to be (number of documents is one of
> the things we base our license pricing on). I still feel like shard
> management could be made easier. I'll see if I can have a look at JIRA and
> try to pitch in.
>
> Thanks a lot for the input, especially Shawn & Erick!

Re: How large is your solr index?

Posted by Bram Van Dam <br...@intix.eu>.
> Do note that one strategy is to create more shards than you need at
> the beginning. Say you determine that 10 shards will work fine, but
> you expect to grow your corpus by 2x. _Start_  with 20 shards
> (multiple shards can be hosted in the same JVM, no problem, see
> maxShardsPerNode in the collections API CREATE action. Then
> as your corpus grows you can move the shards to their own
> boxes.

I guess planning ahead is something we can do. We usually have a pretty 
good idea of how large our indexes are going to be (number of documents 
is one of the things we base our license pricing on). I still feel like 
shard management could be made easier. I'll see if I can have a look at 
JIRA and try to pitch in.

Thanks a lot for the input, especially Shawn & Erick!

Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
bq: you'll end up with N-2 nearly full boxes and 2 half-full boxes.

True, you'd have to repeat the process N times. At that point, though,
as Shawn mentions it's often easier to just re-index the whole thing.

Do note that one strategy is to create more shards than you need at
the beginning. Say you determine that 10 shards will work fine, but
you expect to grow your corpus by 2x. _Start_  with 20 shards
(multiple shards can be hosted in the same JVM, no problem, see
maxShardsPerNode in the collections API CREATE action. Then
as your corpus grows you can move the shards to their own
boxes.

This just kicks the can down the road of course, if your corpus grows
by 5x instead of 2x you're back to this discussion....

Best,
Erick

On Thu, Jan 8, 2015 at 7:08 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 1/8/2015 4:37 AM, Bram Van Dam wrote:
>> Hmm. That is a good point. I wonder if there's some kind of middle
>> ground here? Something that lets me send an update (or new document) to
>> an arbitrary node/shard but which is still routed according to my
>> specific requirements? Maybe this can already be achieved by messing
>> with the routing?
>
> <snip>
>
>> That's fine. We have a lot of query (pre-)processing outside of Solr.
>> It's no problem for us to send a couple of queries to a couple of shards
>> and aggregate the result ourselves. It would, of course, be nice if
>> everything worked in distributed mode, but at least for us it's not an
>> issue. This is a side effect of our complex reporting requirements -- we
>> do aggregation, filtering and other magic on data that is partially in
>> Solr and partially elsewhere.
>
> SolrCloud, when you do fully automatic document routing, does handle
> everything for you.  You can query any node and send updates to any
> node, and they will end up in the right place.  There is currently a
> strong caveat: Indexing performance sucks when updates are initially
> sent to the wrong node.  The performance hit is far larger than we
> expected it to be, so there is an issue in Jira to try and make that
> better.  No visible work has been done on the issue yet:
>
> https://issues.apache.org/jira/browse/SOLR-6717
>
> The Java client (SolrJ, specifically CloudSolrServer) sends all updates
> to the correct nodes, because it can access the clusterstate and knows
> where updates need to go and where the shard leaders are.
>
>> This is a very good point. But I don't think SPLITSHARD is the magical
>> answer here. If you have N shards on N boxes, and they are all getting
>> nearly "full" and you decide to split one and move half to a new box,
>> you'll end up with N-2 nearly full boxes and 2 half-full boxes. What
>> happens if the disks fill up further? Do I have to split each shard?
>> That sounds pretty nightmareish!
>
> Planning ahead for growth is critical with SolrCloud, but there is
> something you can do if you discover that you need to radically
> re-shard:  Create a whole new collection with the number of shards you
> want, likely using the original set of Solr servers plus some new ones.
>  Rebuild the index into that collection.  Delete the old collection, and
> create a collection alias pointing the original name at the new
> collection.  The alias will work for both queries and updates.
>
> Thanks,
> Shawn
>

Re: How large is your solr index?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/8/2015 4:37 AM, Bram Van Dam wrote:
> Hmm. That is a good point. I wonder if there's some kind of middle
> ground here? Something that lets me send an update (or new document) to
> an arbitrary node/shard but which is still routed according to my
> specific requirements? Maybe this can already be achieved by messing
> with the routing?

<snip>

> That's fine. We have a lot of query (pre-)processing outside of Solr.
> It's no problem for us to send a couple of queries to a couple of shards
> and aggregate the result ourselves. It would, of course, be nice if
> everything worked in distributed mode, but at least for us it's not an
> issue. This is a side effect of our complex reporting requirements -- we
> do aggregation, filtering and other magic on data that is partially in
> Solr and partially elsewhere.

SolrCloud, when you do fully automatic document routing, does handle
everything for you.  You can query any node and send updates to any
node, and they will end up in the right place.  There is currently a
strong caveat: Indexing performance sucks when updates are initially
sent to the wrong node.  The performance hit is far larger than we
expected it to be, so there is an issue in Jira to try and make that
better.  No visible work has been done on the issue yet:

https://issues.apache.org/jira/browse/SOLR-6717

The Java client (SolrJ, specifically CloudSolrServer) sends all updates
to the correct nodes, because it can access the clusterstate and knows
where updates need to go and where the shard leaders are.

> This is a very good point. But I don't think SPLITSHARD is the magical
> answer here. If you have N shards on N boxes, and they are all getting
> nearly "full" and you decide to split one and move half to a new box,
> you'll end up with N-2 nearly full boxes and 2 half-full boxes. What
> happens if the disks fill up further? Do I have to split each shard?
> That sounds pretty nightmareish!

Planning ahead for growth is critical with SolrCloud, but there is
something you can do if you discover that you need to radically
re-shard:  Create a whole new collection with the number of shards you
want, likely using the original set of Solr servers plus some new ones.
 Rebuild the index into that collection.  Delete the old collection, and
create a collection alias pointing the original name at the new
collection.  The alias will work for both queries and updates.

Thanks,
Shawn


Re: How large is your solr index?

Posted by Bram Van Dam <br...@intix.eu>.
On 01/07/2015 05:42 PM, Erick Erickson wrote:
> True, and you can do this if you take explicit control of the document
> routing, but...
> that's quite tricky. You forever after have to send any _updates_ to the same
> shard you did the first time, whereas SPLITSHARD will "do the right thing".

Hmm. That is a good point. I wonder if there's some kind of middle 
ground here? Something that lets me send an update (or new document) to 
an arbitrary node/shard but which is still routed according to my 
specific requirements? Maybe this can already be achieved by messing 
with the routing?

> <snip> there are some components that don't do the right thing in
> distributed mode, joins for instance. The list is actually quite small and
> is getting smaller all the time.


That's fine. We have a lot of query (pre-)processing outside of Solr. 
It's no problem for us to send a couple of queries to a couple of shards 
and aggregate the result ourselves. It would, of course, be nice if 
everything worked in distributed mode, but at least for us it's not an 
issue. This is a side effect of our complex reporting requirements -- we 
do aggregation, filtering and other magic on data that is partially in 
Solr and partially elsewhere.

> Not true if the other shards have had any indexing activity. The commit is
> usually forwarded to all shards. If the individual index on a
> particular shard is
> unchanged then it should be a no-op though.

I think a no-op commit no longer clears the caches either, so that's great.

> But the usage pattern here is its own bit of a trap. If all your
> indexing is going
> to a single shard, then also the entire indexing _load_ is happening on that
> shard. So the CPU utilization will be higher on that shard than the older ones.
> Since distributed requests need to get a response from every shard before
> returning to the client, the response time will be bounded by the response from
> the slowest shard and this may actually be slower. Probably only noticeable
> when the CPU is maxed anyway though.

This is a very good point. But I don't think SPLITSHARD is the magical 
answer here. If you have N shards on N boxes, and they are all getting 
nearly "full" and you decide to split one and move half to a new box, 
you'll end up with N-2 nearly full boxes and 2 half-full boxes. What 
happens if the disks fill up further? Do I have to split each shard? 
That sounds pretty nightmareish!

  - Bram

Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
See below:


On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam <br...@intix.eu> wrote:
> On 01/06/2015 07:54 PM, Erick Erickson wrote:
>>
>> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>> API command?
>
>
> I think that's the direction we'll probably be going. Index size (at least
> for us) can be unpredictable in some cases. Some clients start out small and
> then grow exponentially, while others start big and then don't grow much at
> all. Starting with SolrCloud would at least give us that flexibility.
>
> That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a certain
> size, it would be better for us to simply add an extra shard, without
> splitting.
>

True, and you can do this if you take explicit control of the document
routing, but...
that's quite tricky. You forever after have to send any _updates_ to the same
shard you did the first time, whereas SPLITSHARD will "do the right thing".

>
>> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <pe...@gmail.com>
>> wrote:
>>>
>>> ++1 for the automagic shard creator. We've been looking into doing this
>>> sort of thing internally - i.e. when a shard reaches a certain size/num
>>> docs, it creates 'sub-shards' to which new commits are sent and queries
>>> to
>>> the 'parent' shard are included. The concept works, as long as you don't
>>> try any non-dist stuff - it's one reason why all our fields are always
>>> single valued.
>
>
> Is there a problem with multi-valued fields and distributed queries?

No. But there are some components that don't do the right thing in
distributed mode, joins for instance. The list is actually quite small and
is getting smaller all the time.

>
>>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
>>> the
>>> parent shard then stops suffering from auto-warming latency due to
>>> commits
>>> (we do a fair amount of committing). In theory, you could carry on
>>> sub-sharding until your hardware starts gasping for air.
>
>
> Sounds like you're doing something similar to us. In some cases we have a
> hard commit every minute. Keeping the caches hot seems like a very good
> reason to send data to a specific shard. At least I'm assuming that when you
> add documents to a single shard and commit; the other shards won't be
> impacted...

Not true if the other shards have had any indexing activity. The commit is
usually forwarded to all shards. If the individual index on a
particular shard is
unchanged then it should be a no-op though.

But the usage pattern here is its own bit of a trap. If all your
indexing is going
to a single shard, then also the entire indexing _load_ is happening on that
shard. So the CPU utilization will be higher on that shard than the older ones.
Since distributed requests need to get a response from every shard before
returning to the client, the response time will be bounded by the response from
the slowest shard and this may actually be slower. Probably only noticeable
when the CPU is maxed anyway though.



>
>  - Bram
>

Re: How large is your solr index?

Posted by Bram Van Dam <br...@intix.eu>.
On 01/06/2015 07:54 PM, Erick Erickson wrote:
> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
> API command?

I think that's the direction we'll probably be going. Index size (at 
least for us) can be unpredictable in some cases. Some clients start out 
small and then grow exponentially, while others start big and then don't 
grow much at all. Starting with SolrCloud would at least give us that 
flexibility.

That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a 
certain size, it would be better for us to simply add an extra shard, 
without splitting.


> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <pe...@gmail.com> wrote:
>> ++1 for the automagic shard creator. We've been looking into doing this
>> sort of thing internally - i.e. when a shard reaches a certain size/num
>> docs, it creates 'sub-shards' to which new commits are sent and queries to
>> the 'parent' shard are included. The concept works, as long as you don't
>> try any non-dist stuff - it's one reason why all our fields are always
>> single valued.

Is there a problem with multi-valued fields and distributed queries?

>> A cool side-effect of sub-sharding (for lack of a snappy term) is that the
>> parent shard then stops suffering from auto-warming latency due to commits
>> (we do a fair amount of committing). In theory, you could carry on
>> sub-sharding until your hardware starts gasping for air.

Sounds like you're doing something similar to us. In some cases we have 
a hard commit every minute. Keeping the caches hot seems like a very 
good reason to send data to a specific shard. At least I'm assuming that 
when you add documents to a single shard and commit; the other shards 
won't be impacted...

  - Bram


Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?
Even after that's done, the sub-shard needs to be physically moved to
another machine
(probably), but that too could be scripted.

May not be desirable, but I thought I'd mention it.

Best,
Erick

On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <pe...@gmail.com> wrote:
> Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
> performs reasonably well on commodity hardware with lots of faceting and
> concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
> it works.
>
> ++1 for the automagic shard creator. We've been looking into doing this
> sort of thing internally - i.e. when a shard reaches a certain size/num
> docs, it creates 'sub-shards' to which new commits are sent and queries to
> the 'parent' shard are included. The concept works, as long as you don't
> try any non-dist stuff - it's one reason why all our fields are always
> single valued. There are also other implications like cleanup, deletes and
> security to take into account, to name a few.
> A cool side-effect of sub-sharding (for lack of a snappy term) is that the
> parent shard then stops suffering from auto-warming latency due to commits
> (we do a fair amount of committing). In theory, you could carry on
> sub-sharding until your hardware starts gasping for air.
>
>
> On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam <br...@intix.eu> wrote:
>
>> On 01/04/2015 02:22 AM, Jack Krupansky wrote:
>>
>>> The reality doesn't seem to
>>> be there today. 50 to 100 million documents, yes, but beyond that takes
>>> some kind of "heroic" effort, whether a much beefier box, very careful and
>>> limited data modeling or limiting of query capabilities or tolerance of
>>> higher latency, expert tuning, etc.
>>>
>>
>> I disagree. On the scale, at least. Up until 500M Solr performs "well"
>> (read: well enough considering the scale) in a single shard on a single box
>> of commodity hardware. Without any tuning or heroic efforts. Sure, some
>> queries aren't as snappy as you'd like, and sure, indexing and querying at
>> the same time will be somewhat unpleasant, but it will work, and it will
>> work well enough.
>>
>> Will it work for thousands of concurrent users? Of course not. Anyone who
>> is after that sort of thing won't find themselves in this scenario -- they
>> will throw hardware at the problem.
>>
>> There is something to be said for making sharding less painful. It would
>> be nice if, for instance, Solr would automagically create a new shard once
>> some magic number was reached (2B at the latest, I guess). But then that'll
>> break some query features ... :-(
>>
>> The reason we're using single large instances (sometimes on beefy
>> hardware) is that SolrCloud is a pain. Not just from an administrative
>> point of view (though that seems to be getting better, kudos for that!),
>> but mostly because some queries cannot be executed with distributed=true.
>> Our users, at least, prefer a slow query over an impossible query.
>>
>> Actually, this 2B limit is a good thing. It'll help me convince
>> $management to donate some of our time to Solr :-)
>>
>>  - Bram
>>

Re: How large is your solr index?

Posted by Peter Sturge <pe...@gmail.com>.
Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
performs reasonably well on commodity hardware with lots of faceting and
concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
it works.

++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and queries to
the 'parent' shard are included. The concept works, as long as you don't
try any non-dist stuff - it's one reason why all our fields are always
single valued. There are also other implications like cleanup, deletes and
security to take into account, to name a few.
A cool side-effect of sub-sharding (for lack of a snappy term) is that the
parent shard then stops suffering from auto-warming latency due to commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.


On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam <br...@intix.eu> wrote:

> On 01/04/2015 02:22 AM, Jack Krupansky wrote:
>
>> The reality doesn't seem to
>> be there today. 50 to 100 million documents, yes, but beyond that takes
>> some kind of "heroic" effort, whether a much beefier box, very careful and
>> limited data modeling or limiting of query capabilities or tolerance of
>> higher latency, expert tuning, etc.
>>
>
> I disagree. On the scale, at least. Up until 500M Solr performs "well"
> (read: well enough considering the scale) in a single shard on a single box
> of commodity hardware. Without any tuning or heroic efforts. Sure, some
> queries aren't as snappy as you'd like, and sure, indexing and querying at
> the same time will be somewhat unpleasant, but it will work, and it will
> work well enough.
>
> Will it work for thousands of concurrent users? Of course not. Anyone who
> is after that sort of thing won't find themselves in this scenario -- they
> will throw hardware at the problem.
>
> There is something to be said for making sharding less painful. It would
> be nice if, for instance, Solr would automagically create a new shard once
> some magic number was reached (2B at the latest, I guess). But then that'll
> break some query features ... :-(
>
> The reason we're using single large instances (sometimes on beefy
> hardware) is that SolrCloud is a pain. Not just from an administrative
> point of view (though that seems to be getting better, kudos for that!),
> but mostly because some queries cannot be executed with distributed=true.
> Our users, at least, prefer a slow query over an impossible query.
>
> Actually, this 2B limit is a good thing. It'll help me convince
> $management to donate some of our time to Solr :-)
>
>  - Bram
>

Re: How large is your solr index?

Posted by Bram Van Dam <br...@intix.eu>.
On 01/04/2015 02:22 AM, Jack Krupansky wrote:
> The reality doesn't seem to
> be there today. 50 to 100 million documents, yes, but beyond that takes
> some kind of "heroic" effort, whether a much beefier box, very careful and
> limited data modeling or limiting of query capabilities or tolerance of
> higher latency, expert tuning, etc.

I disagree. On the scale, at least. Up until 500M Solr performs "well" 
(read: well enough considering the scale) in a single shard on a single 
box of commodity hardware. Without any tuning or heroic efforts. Sure, 
some queries aren't as snappy as you'd like, and sure, indexing and 
querying at the same time will be somewhat unpleasant, but it will work, 
and it will work well enough.

Will it work for thousands of concurrent users? Of course not. Anyone 
who is after that sort of thing won't find themselves in this scenario 
-- they will throw hardware at the problem.

There is something to be said for making sharding less painful. It would 
be nice if, for instance, Solr would automagically create a new shard 
once some magic number was reached (2B at the latest, I guess). But then 
that'll break some query features ... :-(

The reason we're using single large instances (sometimes on beefy 
hardware) is that SolrCloud is a pain. Not just from an administrative 
point of view (though that seems to be getting better, kudos for that!), 
but mostly because some queries cannot be executed with 
distributed=true. Our users, at least, prefer a slow query over an 
impossible query.

Actually, this 2B limit is a good thing. It'll help me convince 
$management to donate some of our time to Solr :-)

  - Bram

Re: How large is your solr index?

Posted by Jack Krupansky <ja...@gmail.com>.
That's a laudable goal - to support low-latency queries - including
faceting - for "hundreds of millions" of documents, using Solr "out of the
box" on a random, commodity box selected by IT and just adding a dozen or
two fields to the default schema that are both indexed and stored, without
any "expert" tuning, by an "average" developer. The reality doesn't seem to
be there today. 50 to 100 million documents, yes, but beyond that takes
some kind of "heroic" effort, whether a much beefier box, very careful and
limited data modeling or limiting of query capabilities or tolerance of
higher latency, expert tuning, etc.

The proof is always in the pudding - pick a box, install Solr, setup the
schema, load 20 or 50 or 100 or 250 or 350 million documents, try some
queries with the features you need, and you get what you get.

But I agree that it would be highly desirable to push that 100 million
number up to 350 million or even 500 million ASAP since the pain of
unnecessarily sharding is unnecessarily excessive.

I wonder what changes will have to occur in Lucene, or... what evolution in
commodity hardware will be necessary to get there.

-- Jack Krupansky

On Sat, Jan 3, 2015 at 6:11 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> Erick Erickson [erickerickson@gmail.com] wrote:
> > I can't disagree. You bring up some of the points that make me
> _extremely_
> > reluctant to try to get this in to 5.x though. 6.0 at the earliest I
> should
> > think.
>
> Ignoring the magic 2b number for a moment, I think the overall question is
> whether or not single shards should perform well in the hundreds of
> millions of documents range. The alternative is more shards, but it is
> quite an explicit process to handle shard-juggling. From an end-user
> perspective, the underlying technology matters little: Whatever the choice,
> it should be possible to install "something" on a machine and expect it to
> scale within the hardware limitations without much ado.
>
> - Toke Eskildsen
>

RE: How large is your solr index?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Erick Erickson [erickerickson@gmail.com] wrote:
> I can't disagree. You bring up some of the points that make me _extremely_
> reluctant to try to get this in to 5.x though. 6.0 at the earliest I should
> think.

Ignoring the magic 2b number for a moment, I think the overall question is whether or not single shards should perform well in the hundreds of millions of documents range. The alternative is more shards, but it is quite an explicit process to handle shard-juggling. From an end-user perspective, the underlying technology matters little: Whatever the choice, it should be possible to install "something" on a machine and expect it to scale within the hardware limitations without much ado.

- Toke Eskildsen

Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
I can't disagree. You bring up some of the points that make me _extremely_
reluctant to try to get this in to 5.x though. 6.0 at the earliest I should
think.

And who knows? Java may get a GC process that's geared to modern
amounts of memory and get by the current pain....

Best,
Erick

On Sat, Jan 3, 2015 at 1:00 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> Erick Erickson [erickerickson@gmail.com] wrote:
> > Of course I wouldn't be doing the work so I really don't have much of
> > a vote, but it's not clear to me at all that enough people would actually
> > have a use-case for 2b+ docs in a single shard to make it
> > worthwhile. At that scale GC potentially becomes really unpleasant for
> > instance....
>
> Over the last years we have seen a few use cases here on the mailing list.
> I would be very surprised if the number of such cases does not keep rising.
> Currently the work for a complete overhaul does not measure up to the
> rewards, but that is slowly changing. At the very least I find it prudent
> to not limit new Lucene/Solr interfaces to ints.
>
> As for GC: Right now a lot of structures are single-array oriented (for
> example using a long-array to represent bits in a bitset), which might not
> work well with current garbage collectors. A change to higher limits also
> means re-thinking such approaches: If the garbage collectors likes objects
> below a certain size then split the arrays into that. Likewise, iterations
> over structures linear in size to the index could be threaded. These are
> issues even with the current 2b limitation.
>
> - Toke Eskildsen
>

Re: How large is your solr index?

Posted by Jack Krupansky <ja...@gmail.com>.
Back in June on a similar thread I asked "Anybody care to forecast when
hardware will catch up with Solr and we can routinely look forward to
newbies complaining that they indexed "some" data and after only 10 minutes
they hit this weird 2G document count limit?"

Still not there.

So the race is on between when Lucene will relax the 2G limit and when
hardware gets fast enough that 2G documents can be indexed within a small
number of hours.


-- Jack Krupansky

On Sat, Jan 3, 2015 at 4:00 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> Erick Erickson [erickerickson@gmail.com] wrote:
> > Of course I wouldn't be doing the work so I really don't have much of
> > a vote, but it's not clear to me at all that enough people would actually
> > have a use-case for 2b+ docs in a single shard to make it
> > worthwhile. At that scale GC potentially becomes really unpleasant for
> > instance....
>
> Over the last years we have seen a few use cases here on the mailing list.
> I would be very surprised if the number of such cases does not keep rising.
> Currently the work for a complete overhaul does not measure up to the
> rewards, but that is slowly changing. At the very least I find it prudent
> to not limit new Lucene/Solr interfaces to ints.
>
> As for GC: Right now a lot of structures are single-array oriented (for
> example using a long-array to represent bits in a bitset), which might not
> work well with current garbage collectors. A change to higher limits also
> means re-thinking such approaches: If the garbage collectors likes objects
> below a certain size then split the arrays into that. Likewise, iterations
> over structures linear in size to the index could be threaded. These are
> issues even with the current 2b limitation.
>
> - Toke Eskildsen
>

RE: How large is your solr index?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Erick Erickson [erickerickson@gmail.com] wrote:
> Of course I wouldn't be doing the work so I really don't have much of
> a vote, but it's not clear to me at all that enough people would actually
> have a use-case for 2b+ docs in a single shard to make it
> worthwhile. At that scale GC potentially becomes really unpleasant for
> instance....

Over the last years we have seen a few use cases here on the mailing list. I would be very surprised if the number of such cases does not keep rising. Currently the work for a complete overhaul does not measure up to the rewards, but that is slowly changing. At the very least I find it prudent to not limit new Lucene/Solr interfaces to ints.

As for GC: Right now a lot of structures are single-array oriented (for example using a long-array to represent bits in a bitset), which might not work well with current garbage collectors. A change to higher limits also means re-thinking such approaches: If the garbage collectors likes objects below a certain size then split the arrays into that. Likewise, iterations over structures linear in size to the index could be threaded. These are issues even with the current 2b limitation.

- Toke Eskildsen

Re: How large is your solr index?

Posted by Erick Erickson <er...@gmail.com>.
bq: For Solr 5 why don't we switch it to 64 bit ??

-1 on this for a couple of reasons
> it'd be pretty invasive, and 5.0 may be imminent. Far too big a change to implement at the last second
> It's not clear that it's even useful. Once you get to that many documents, performance usually suffers

Of course I wouldn't be doing the work so I really don't have much of
a vote, but it's not clear to me at
all that enough people would actually have a use-case for 2b+ docs in
a single shard to make it
worthwhile. At that scale GC potentially becomes really unpleasant for
instance....

FWIW,
Erick

On Sat, Jan 3, 2015 at 2:45 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> Bill Bell [billnbell@gmail.com] wrote:
>
> [solr maxdoc limit of 2b]
>
>> For Solr 5 why don't we switch it to 64 bit ??
>
> The biggest challenge for a switch is that Java's arrays can only hold 2b values. I support the idea of switching to much larger minimums throughout the code. But it is a larger fix than replacing int with long.
>
> - Toke Eskildsen

RE: How large is your solr index?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Bill Bell [billnbell@gmail.com] wrote:

[solr maxdoc limit of 2b]

> For Solr 5 why don't we switch it to 64 bit ??

The biggest challenge for a switch is that Java's arrays can only hold 2b values. I support the idea of switching to much larger minimums throughout the code. But it is a larger fix than replacing int with long. 

- Toke Eskildsen