You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeffery Yuan <yu...@gmail.com> on 2016/09/23 20:58:54 UTC

Re: Whether SolrCloud can support 2 TB data?

Thanks so much for your prompt reply.

We are definitely going to use SolrCloud.

I am just wondering whether SolrCloud can scale even at TB data level and
what kind of hardware configuration it should be.

Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Whether SolrCloud can support 2 TB data?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
John Bickerstaff <jo...@johnbickerstaff.com> wrote:
> As an aside - I just spoke with somone the other day who is using Hadoop
> for re-index in order to save a lot of time.

If you control which documents goes into which shards, then that is certainly a possibility. We have a collection with long re-indexing time (about 20 CPU-core years), but are able to build the shards independently of each other, so it scales near-perfect with more hardware. The cheat is that our documents are never updated, so everything is always new and just appended to the latest shard being build. We don't use Hadoop, but the principle is the same.

- Toke Eskildsen

Re: Whether SolrCloud can support 2 TB data?

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
As an aside - I just spoke with somone the other day who is using Hadoop
for re-index in order to save a lot of time.  I don't know the details, but
I assume they're using Hadoop to call Lucene code and index documents using
the map-reduce approach...

This was made in their own shop - I don't think the code is available as
open source, but it works for them as a way to really cut down re-indexing
time for extremely large data sets.

On Sat, Sep 24, 2016 at 8:15 AM, Yago Riveiro <ya...@gmail.com>
wrote:

>  "LucidWorks achieved 150k docs/second"
>
>
>
> This is only valid is you don't have replication, I don't know your use
> case,
> but a realistic use case normally use some type of redundancy to not lost
> data
> in a hardware failure, at least 2 replicas, more implicates a reduction of
> throughput. Also don't forget that in an realistic use case you should
> handle
> reads too.
>
> Our cluster is small for the data we hold (12 machines with SSD and 32G of
> RAM), but we don't need sub-second queries, we need facet with high
> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
> As Shawn probably told you, sizing your cluster is a try and error path.
> Our
> cluster is optimize to handle a low rate of reads, facet queries and a high
> rate of inserts.
>
> In a peak of inserts we can handle around 25K docs per second without any
> issue with 2 replicas and without compromise reads or put a node in stress.
> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or
> a
> lack of CPU to communicate.
>
> If you want accuracy data you need to do test.
>
> Keep in mind the most important thing about solr in my opinion, in a
> terabyte
> scale any field type schema change or LuceneCodec change will force you to
> do
> a full reindex. Each time I need to update Solr to a major release it's a
> pain
> in the ass to convert the segments if are not compatible with newer
> version.
> This can take months, will not ensure your data will be equal that a clean
> index (voodoo magic thing can happen, thrust me), and it will drain a huge
> amount of hardware resources to do it without downtime.
>
>
> \--
>
>
>
> /Yago Riveiro
>
> ![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/local-277ee09e-
> 1aee?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>
>
> On Sep 24 2016, at 7:48 am, S G <sg...@gmail.com> wrote:
>
> > Hey Yago,
>
> >
>
> > 12 T is very impressive.
>
> >
>
> > Can you also share some numbers about the shards, replicas, machine
> count/specs and docs/second for your case?
> I think you would not be having a single index of 12 TB too. So some
> details on that would be really helpful too.
>
> >
>
> > https://lucidworks.com/blog/2014/06/03/introducing-the-
> solr-scale-toolkit/
> is a good post how LucidWorks achieved 150k docs/second.
> If you have any such similar blog, that would be quite useful and popular
> too.
>
> >
>
> > \--SG
>
> >
>
> > On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro <ya...@gmail.com>
> wrote:
>
> >
>
> > > In my company we have a SolrCloud cluster with 12T.
> >
> > My advices:
> >
> > Be nice with CPU you will needed in some point (very important if you
> have
> > not control over the kind of queries to the cluster, clients are greedy,
> > the want all results at the same time)
> >
> > SSD and memory (as many as you can afford if you will do facets)
> >
> > Full recoveries are a pain, network it's important and should be as fast
> > as possible, never less than 1Gbit.
> >
> > Divide and conquer, but too much can drive you to an expensive overhead,
> > data travels over the network. Find the sweet point (only testing you use
> > case you will know)
> >
> > \--
> >
> > /Yago Riveiro
> >
> > On 23 Sep 2016, 23:44 +0100, Pushkar Raste <pu...@gmail.com>,
> > wrote:
> > > Solr is RAM hungry. Make sure that you have enough RAM to have most if
> > the
> > > index of a core in the RAM itself.
> > >
> > > You should also consider using really good SSDs.
> > >
> > > That would be a good start. Like others said, test and verify your
> setup.
> > >
> > > \--Pushkar Raste
> > >
> > > On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yu...@gmail.com> wrote:
> > >
> > > Thanks so much for your prompt reply.
> > >
> > > We are definitely going to use SolrCloud.
> > >
> > > I am just wondering whether SolrCloud can scale even at TB data level
> and
> > > what kind of hardware configuration it should be.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > \--
> > > View this message in context: [http://lucene.472066.n3.](htt
> p://lucene.472
> 066.n3.&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
> > > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>

Re: Whether SolrCloud can support 2 TB data?

Posted by Erick Erickson <er...@gmail.com>.
John:

The MapReduceIndexerTool (in contrib) is intended for bulk indexing in
a Hadoop ecosystem. This doesn't preclude home-grown setups of course,
but it's available OOB. The only tricky bit is at the end. Either you
have your Solr indexes on HDFS in which case MRIT can merge them into
a live Solr cluster or you have to copy them from HDFS to your
local-disk indexes (and, of course, get the shards right). It's a
pretty slick utility, it reads from Zookeeper to understand the number
of shards required and does the whole map/reduce thing to distribute
the work.

As an aside, it uses EmbeddedSolrServer to do _exactly_ the same thing
as indexing to a Solr installation, reads the configs from ZK etc.

Then there's SparkSolr, a way to index from M/R jobs directly to live
Solr setups. The throughput there is limited by how many docs/second
you can process on each shard X #shards.

BTW, in a highly optimized-for-updates setup I've seen 1M+ docs/second
achieved. Don't try this at home, it takes quite a bit of
infrastructure....

As Yago says,  adding replicas imposes about a penalty, I've typically
seen 20-30% in terms of indexing throughput. You can ameliorate this
by adding more shards, but that adds other complexities.

But I cannot over-emphasize how much "it depends" (tm). I was setting
up a stupid-simple index where all I wanted was a bunch of docs with
exactly one simple field plus the ID. On my laptop I was seeing 50K
docs/second in a single shard.

Then for another test case I was doing an ngrammed (mingram-2,
maxgram-32) and was seeing < 100 docs/second. There's simply no way to
translate from the raw data size to hardware specs, unfortunately.

Best,
Erick

On Sat, Sep 24, 2016 at 10:48 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> Regarding a 12TB index:
>
> Yago Riveiro <ya...@gmail.com> wrote:
>
>> Our cluster is small for the data we hold (12 machines with SSD and 32G of
>> RAM), but we don't need sub-second queries, we need facet with high
>> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
>> In a peak of inserts we can handle around 25K docs per second without any
>> issue with 2 replicas and without compromise reads or put a node in stress.
>> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
>> lack of CPU to communicate.
>
> I am surprised that you manage to have this working on that hardware. As you have replicas, it seems to me that you handle 2*12TB of index with 12*32GB of RAM? This is very close to our setup (22TB of index with 320GB of RAM (updated last week from 256GB) per machine), but we benefit hugely from having a static index.
>
> I assume the SSDs are local? How much memory do you use for heap on each machine?
>
> - Toke Eskildsen

Re: Whether SolrCloud can support 2 TB data?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Regarding a 12TB index:

Yago Riveiro <ya...@gmail.com> wrote:

> Our cluster is small for the data we hold (12 machines with SSD and 32G of
> RAM), but we don't need sub-second queries, we need facet with high
> cardinality (in worst case scenarios we aggregate 5M unique string values)

> In a peak of inserts we can handle around 25K docs per second without any
> issue with 2 replicas and without compromise reads or put a node in stress.
> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
> lack of CPU to communicate.

I am surprised that you manage to have this working on that hardware. As you have replicas, it seems to me that you handle 2*12TB of index with 12*32GB of RAM? This is very close to our setup (22TB of index with 320GB of RAM (updated last week from 256GB) per machine), but we benefit hugely from having a static index.

I assume the SSDs are local? How much memory do you use for heap on each machine?

- Toke Eskildsen

Re: Whether SolrCloud can support 2 TB data?

Posted by Yago Riveiro <ya...@gmail.com>.
 "LucidWorks achieved 150k docs/second"

  

This is only valid is you don't have replication, I don't know your use case,
but a realistic use case normally use some type of redundancy to not lost data
in a hardware failure, at least 2 replicas, more implicates a reduction of
throughput. Also don't forget that in an realistic use case you should handle
reads too.  
  
Our cluster is small for the data we hold (12 machines with SSD and 32G of
RAM), but we don't need sub-second queries, we need facet with high
cardinality (in worst case scenarios we aggregate 5M unique string values)  
  
As Shawn probably told you, sizing your cluster is a try and error path. Our
cluster is optimize to handle a low rate of reads, facet queries and a high
rate of inserts.  
  
In a peak of inserts we can handle around 25K docs per second without any
issue with 2 replicas and without compromise reads or put a node in stress.
Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
lack of CPU to communicate.  
  
If you want accuracy data you need to do test.  
  
Keep in mind the most important thing about solr in my opinion, in a terabyte
scale any field type schema change or LuceneCodec change will force you to do
a full reindex. Each time I need to update Solr to a major release it's a pain
in the ass to convert the segments if are not compatible with newer version.
This can take months, will not ensure your data will be equal that a clean
index (voodoo magic thing can happen, thrust me), and it will drain a huge
amount of hardware resources to do it without downtime.

  
\--

  

/Yago Riveiro

![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/local-277ee09e-
1aee?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)

  
On Sep 24 2016, at 7:48 am, S G <sg...@gmail.com> wrote:  

> Hey Yago,

>

> 12 T is very impressive.

>

> Can you also share some numbers about the shards, replicas, machine  
count/specs and docs/second for your case?  
I think you would not be having a single index of 12 TB too. So some  
details on that would be really helpful too.

>

> https://lucidworks.com/blog/2014/06/03/introducing-the-solr-scale-toolkit/  
is a good post how LucidWorks achieved 150k docs/second.  
If you have any such similar blog, that would be quite useful and popular  
too.

>

> \--SG

>

> On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro <ya...@gmail.com>  
wrote:

>

> > In my company we have a SolrCloud cluster with 12T.  
>  
> My advices:  
>  
> Be nice with CPU you will needed in some point (very important if you have  
> not control over the kind of queries to the cluster, clients are greedy,  
> the want all results at the same time)  
>  
> SSD and memory (as many as you can afford if you will do facets)  
>  
> Full recoveries are a pain, network it's important and should be as fast  
> as possible, never less than 1Gbit.  
>  
> Divide and conquer, but too much can drive you to an expensive overhead,  
> data travels over the network. Find the sweet point (only testing you use  
> case you will know)  
>  
> \--  
>  
> /Yago Riveiro  
>  
> On 23 Sep 2016, 23:44 +0100, Pushkar Raste <pu...@gmail.com>,  
> wrote:  
> > Solr is RAM hungry. Make sure that you have enough RAM to have most if  
> the  
> > index of a core in the RAM itself.  
> >  
> > You should also consider using really good SSDs.  
> >  
> > That would be a good start. Like others said, test and verify your setup.  
> >  
> > \--Pushkar Raste  
> >  
> > On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yu...@gmail.com> wrote:  
> >  
> > Thanks so much for your prompt reply.  
> >  
> > We are definitely going to use SolrCloud.  
> >  
> > I am just wondering whether SolrCloud can scale even at TB data level and  
> > what kind of hardware configuration it should be.  
> >  
> > Thanks.  
> >  
> >  
> >  
> > \--  
> > View this message in context: [http://lucene.472066.n3.](http://lucene.472
066.n3.&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)  
> > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html  
> > Sent from the Solr - User mailing list archive at Nabble.com.  
>


Re: Whether SolrCloud can support 2 TB data?

Posted by S G <sg...@gmail.com>.
Hey Yago,

12 T is very impressive.

Can you also share some numbers about the shards, replicas, machine
count/specs and docs/second for your case?
I think you would not be having a single index of 12 TB too. So some
details on that would be really helpful too.

https://lucidworks.com/blog/2014/06/03/introducing-the-solr-scale-toolkit/
is a good post how LucidWorks achieved 150k docs/second.
If you have any such similar blog, that would be quite useful and popular
too.

--SG

On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro <ya...@gmail.com>
wrote:

> In my company we have a SolrCloud cluster with 12T.
>
> My advices:
>
> Be nice with CPU you will needed in some point (very important if you have
> not control over the kind of queries to the cluster, clients are greedy,
> the want all results at the same time)
>
> SSD and memory (as many as you can afford if you will do facets)
>
> Full recoveries are a pain, network it's important and should be as fast
> as possible, never less than 1Gbit.
>
> Divide and conquer, but too much can drive you to an expensive overhead,
> data travels over the network. Find the sweet point (only testing you use
> case you will know)
>
> --
>
> /Yago Riveiro
>
> On 23 Sep 2016, 23:44 +0100, Pushkar Raste <pu...@gmail.com>,
> wrote:
> > Solr is RAM hungry. Make sure that you have enough RAM to have most if
> the
> > index of a core in the RAM itself.
> >
> > You should also consider using really good SSDs.
> >
> > That would be a good start. Like others said, test and verify your setup.
> >
> > --Pushkar Raste
> >
> > On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yu...@gmail.com> wrote:
> >
> > Thanks so much for your prompt reply.
> >
> > We are definitely going to use SolrCloud.
> >
> > I am just wondering whether SolrCloud can scale even at TB data level and
> > what kind of hardware configuration it should be.
> >
> > Thanks.
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Whether SolrCloud can support 2 TB data?

Posted by Yago Riveiro <ya...@gmail.com>.
In my company we have a SolrCloud cluster with 12T.

My advices:

Be nice with CPU you will needed in some point (very important if you have not control over the kind of queries to the cluster, clients are greedy, the want all results at the same time)

SSD and memory (as many as you can afford if you will do facets)

Full recoveries are a pain, network it's important and should be as fast as possible, never less than 1Gbit.

Divide and conquer, but too much can drive you to an expensive overhead, data travels over the network. Find the sweet point (only testing you use case you will know)

--

/Yago Riveiro

On 23 Sep 2016, 23:44 +0100, Pushkar Raste <pu...@gmail.com>, wrote:
> Solr is RAM hungry. Make sure that you have enough RAM to have most if the
> index of a core in the RAM itself.
>
> You should also consider using really good SSDs.
>
> That would be a good start. Like others said, test and verify your setup.
>
> --Pushkar Raste
>
> On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yu...@gmail.com> wrote:
>
> Thanks so much for your prompt reply.
>
> We are definitely going to use SolrCloud.
>
> I am just wondering whether SolrCloud can scale even at TB data level and
> what kind of hardware configuration it should be.
>
> Thanks.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Whether SolrCloud can support 2 TB data?

Posted by Pushkar Raste <pu...@gmail.com>.
Solr is RAM hungry. Make sure that you have enough RAM to have most if the
index of a core in the RAM itself.

You should also consider using really good SSDs.

That would be a good start. Like others said, test and verify your setup.

--Pushkar Raste

On Sep 23, 2016 4:58 PM, "Jeffery Yuan" <yu...@gmail.com> wrote:

Thanks so much for your prompt reply.

We are definitely going to use SolrCloud.

I am just wondering whether SolrCloud can scale even at TB data level and
what kind of hardware configuration it should be.

Thanks.



--
View this message in context: http://lucene.472066.n3.
nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
Sent from the Solr - User mailing list archive at Nabble.com.