You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Buttler, David" <bu...@llnl.gov> on 2012/11/16 00:04:05 UTC

cores shards and disks in SolrCloud

Hi,
I have a question about the optimal way to distribute solr indexes across a cloud.  I have a small number of collections (less than 10).  And a small cluster (6 nodes), but each node has several disks - 5 of which I am using for my solr indexes.  The cluster is also a hadoop cluster, so the disks are not RAIDed, they are JBOD.  So, on my 5 slave nodes, each with 5 disks, I was thinking of putting one shard per collection.  This means I end up with 25 shards per collection.  If I had 10 collections, that would make it 250 shards total.  Given that Solr 4 supports multi-core, my first thought was to try one JVM for each node: for 10 collections per node, that means that each JVM would contain 50 shards.

So, I set up my first collection, with a modest 20M documents, and everything seems to work fine.  But, now my subsequent collections that I have added are having issues.  The first one is that every time I query for the document count (*:* with rows=0), a different number of documents is returned. The number can differ by as much as 10%.  Now if I query each shard individually (setting distrib=false), the number returned is always consistent.

I am not entirely sure this is related as I may have missed a step in my setup of subsequent collections (bootstrapping the config)

But, more related to the architecture question: is it better to have one JVM per disk, one JVM per shard, or one JVM per node.  Given the MMap of the indexes, how does memory play into the question?   There is a blog post (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html) that recommends minimizing the amount of JVM memory and maximizing the amount of OS-level file cache, but how does that impact sorting / boosting?

Sorry if I have missed some documentation: I have been through the cloud tutorial a couple of times, and I didn't see any discussion of these issues

Thanks,
Dave

Re: cores shards and disks in SolrCloud

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

I think here you want to use a single JVM per server - no need for multiple
JVMs, JVM per Collection and such.
If you can spread data over more than 1 disk on each of your servers,
great, that will help.
Re data loss - yes, you really should just be using replication.  Sharding
a ton will minimize your data loss if you have no replication, but could
actually decrease performance.  Also, if you have lots of small shards, say
250, and only 5 servers, if 1 server dies doesn't that mean you will lose
50 shards - 20% of your data?

Otis
--
Solr Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Thu, Nov 15, 2012 at 8:18 PM, Buttler, David <bu...@llnl.gov> wrote:

> The main reason to split a collection into 25 shards is to reduce the
> impact of the loss of a disk.  I was running an older version of solr, a
> disk went down, and my entire collection was offline.  Solr 4 offers
> shards.tolerant to reduce the impact of the loss of a disk: fewer documents
> will be returned.  Obviously, I could replicate the data so that I wouldn't
> lose any documents while I replace my disk, but since I am already storing
> the original data in HDFS, (with a 3x replication), adding additional
> replication for solr eats into my disk budget a bit too much.
>
> Also, my other collections have larger amounts of data / number of
> documents. For every TB of raw data, how much disk space do I want to be
> using? As little as possible.  Drives are cheap, but not free.  And, nodes
> only hold so many drives.
>
> Dave
>
> -----Original Message-----
> From: Upayavira [mailto:uv@odoko.co.uk]
> Sent: Thursday, November 15, 2012 4:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: cores shards and disks in SolrCloud
>
> Personally I see no benefit to have more than one JVM per node, cores
> can handle it. I would say that splitting a 20m index into 25 shards
> strikes me as serious overkill, unless you expect to expand
> significantly. 20m would likely be okay with two or three shards. You
> can store the indexes for each core on different disks which can give
> ome performance benefit.
>
> Just some thoughts.
>
> Upayavira
>
>
>
> On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote:
> > Hi,
> > I have a question about the optimal way to distribute solr indexes across
> > a cloud.  I have a small number of collections (less than 10).  And a
> > small cluster (6 nodes), but each node has several disks - 5 of which I
> > am using for my solr indexes.  The cluster is also a hadoop cluster, so
> > the disks are not RAIDed, they are JBOD.  So, on my 5 slave nodes, each
> > with 5 disks, I was thinking of putting one shard per collection.  This
> > means I end up with 25 shards per collection.  If I had 10 collections,
> > that would make it 250 shards total.  Given that Solr 4 supports
> > multi-core, my first thought was to try one JVM for each node: for 10
> > collections per node, that means that each JVM would contain 50 shards.
> >
> > So, I set up my first collection, with a modest 20M documents, and
> > everything seems to work fine.  But, now my subsequent collections that I
> > have added are having issues.  The first one is that every time I query
> > for the document count (*:* with rows=0), a different number of documents
> > is returned. The number can differ by as much as 10%.  Now if I query
> > each shard individually (setting distrib=false), the number returned is
> > always consistent.
> >
> > I am not entirely sure this is related as I may have missed a step in my
> > setup of subsequent collections (bootstrapping the config)
> >
> > But, more related to the architecture question: is it better to have one
> > JVM per disk, one JVM per shard, or one JVM per node.  Given the MMap of
> > the indexes, how does memory play into the question?   There is a blog
> > post
> > (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> )
> > that recommends minimizing the amount of JVM memory and maximizing the
> > amount of OS-level file cache, but how does that impact sorting /
> > boosting?
> >
> > Sorry if I have missed some documentation: I have been through the cloud
> > tutorial a couple of times, and I didn't see any discussion of these
> > issues
> >
> > Thanks,
> > Dave
>

RE: cores shards and disks in SolrCloud

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Fri, 2012-11-16 at 02:18 +0100, Buttler, David wrote:
> Obviously, I could replicate the data so
> that I wouldn't lose any documents while I replace my disk, but since I
> am already storing the original data in HDFS, (with a 3x replication),
> adding additional replication for solr eats into my disk budget a bit
> too much.

Nevertheless, limiting damage by excessive sharding is a very peculiar
decision. How many bytes are we talking about here? Do you have multi-TB
indexes?


RE: cores shards and disks in SolrCloud

Posted by "Buttler, David" <bu...@llnl.gov>.
The main reason to split a collection into 25 shards is to reduce the impact of the loss of a disk.  I was running an older version of solr, a disk went down, and my entire collection was offline.  Solr 4 offers shards.tolerant to reduce the impact of the loss of a disk: fewer documents will be returned.  Obviously, I could replicate the data so that I wouldn't lose any documents while I replace my disk, but since I am already storing the original data in HDFS, (with a 3x replication), adding additional replication for solr eats into my disk budget a bit too much.

Also, my other collections have larger amounts of data / number of documents. For every TB of raw data, how much disk space do I want to be using? As little as possible.  Drives are cheap, but not free.  And, nodes only hold so many drives.  

Dave

-----Original Message-----
From: Upayavira [mailto:uv@odoko.co.uk] 
Sent: Thursday, November 15, 2012 4:37 PM
To: solr-user@lucene.apache.org
Subject: Re: cores shards and disks in SolrCloud

Personally I see no benefit to have more than one JVM per node, cores
can handle it. I would say that splitting a 20m index into 25 shards
strikes me as serious overkill, unless you expect to expand
significantly. 20m would likely be okay with two or three shards. You
can store the indexes for each core on different disks which can give
ome performance benefit.

Just some thoughts.

Upayavira



On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote:
> Hi,
> I have a question about the optimal way to distribute solr indexes across
> a cloud.  I have a small number of collections (less than 10).  And a
> small cluster (6 nodes), but each node has several disks - 5 of which I
> am using for my solr indexes.  The cluster is also a hadoop cluster, so
> the disks are not RAIDed, they are JBOD.  So, on my 5 slave nodes, each
> with 5 disks, I was thinking of putting one shard per collection.  This
> means I end up with 25 shards per collection.  If I had 10 collections,
> that would make it 250 shards total.  Given that Solr 4 supports
> multi-core, my first thought was to try one JVM for each node: for 10
> collections per node, that means that each JVM would contain 50 shards.
> 
> So, I set up my first collection, with a modest 20M documents, and
> everything seems to work fine.  But, now my subsequent collections that I
> have added are having issues.  The first one is that every time I query
> for the document count (*:* with rows=0), a different number of documents
> is returned. The number can differ by as much as 10%.  Now if I query
> each shard individually (setting distrib=false), the number returned is
> always consistent.
> 
> I am not entirely sure this is related as I may have missed a step in my
> setup of subsequent collections (bootstrapping the config)
> 
> But, more related to the architecture question: is it better to have one
> JVM per disk, one JVM per shard, or one JVM per node.  Given the MMap of
> the indexes, how does memory play into the question?   There is a blog
> post
> (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html)
> that recommends minimizing the amount of JVM memory and maximizing the
> amount of OS-level file cache, but how does that impact sorting /
> boosting?
> 
> Sorry if I have missed some documentation: I have been through the cloud
> tutorial a couple of times, and I didn't see any discussion of these
> issues
> 
> Thanks,
> Dave

Re: cores shards and disks in SolrCloud

Posted by Upayavira <uv...@odoko.co.uk>.
Personally I see no benefit to have more than one JVM per node, cores
can handle it. I would say that splitting a 20m index into 25 shards
strikes me as serious overkill, unless you expect to expand
significantly. 20m would likely be okay with two or three shards. You
can store the indexes for each core on different disks which can give
ome performance benefit.

Just some thoughts.

Upayavira



On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote:
> Hi,
> I have a question about the optimal way to distribute solr indexes across
> a cloud.  I have a small number of collections (less than 10).  And a
> small cluster (6 nodes), but each node has several disks - 5 of which I
> am using for my solr indexes.  The cluster is also a hadoop cluster, so
> the disks are not RAIDed, they are JBOD.  So, on my 5 slave nodes, each
> with 5 disks, I was thinking of putting one shard per collection.  This
> means I end up with 25 shards per collection.  If I had 10 collections,
> that would make it 250 shards total.  Given that Solr 4 supports
> multi-core, my first thought was to try one JVM for each node: for 10
> collections per node, that means that each JVM would contain 50 shards.
> 
> So, I set up my first collection, with a modest 20M documents, and
> everything seems to work fine.  But, now my subsequent collections that I
> have added are having issues.  The first one is that every time I query
> for the document count (*:* with rows=0), a different number of documents
> is returned. The number can differ by as much as 10%.  Now if I query
> each shard individually (setting distrib=false), the number returned is
> always consistent.
> 
> I am not entirely sure this is related as I may have missed a step in my
> setup of subsequent collections (bootstrapping the config)
> 
> But, more related to the architecture question: is it better to have one
> JVM per disk, one JVM per shard, or one JVM per node.  Given the MMap of
> the indexes, how does memory play into the question?   There is a blog
> post
> (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html)
> that recommends minimizing the amount of JVM memory and maximizing the
> amount of OS-level file cache, but how does that impact sorting /
> boosting?
> 
> Sorry if I have missed some documentation: I have been through the cloud
> tutorial a couple of times, and I didn't see any discussion of these
> issues
> 
> Thanks,
> Dave