You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@thedigitalgroup.net> on 2014/01/24 19:10:10 UTC

Solr server requirements for 100+ million documents

Hi,

Currently we are indexing 10 million document from database (10 db data entities) & index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores & merging them together taking 4+ hours. 

We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. 

No of physical servers (considering solr cloud) 
Memory requirement
Processor requirement (# cores)
Linux as OS oppose to windows 

Thanks in advance. 
Susheel

Re: Solr server requirements for 100+ million documents

Posted by Shawn Heisey <so...@elyograg.org>.

On 2/11/2014 3:28 PM, Susheel Kumar wrote:
> Thanks, Otis for quick reply. So for ZK do you recommend separate servers and if so how many for initial Solr cloud cluster setup.

In a minimal 3-server setup, all servers would run zookeeper and two of 
them would also run Solr.With this setup, you can survive the failure of 
any of those three machines, even if it dies completely.

If the third machine is only running zookeeper, two fast CPU cores and 
2GB of RAM would be plenty.  For 100 million documents, I would 
personally recommend at least 8 CPU cores on the machines running Solr, 
ideally provided by at least two separate physical CPUs.  Otis 
recommended 32GB of RAM as a starting point.  You would very likely want 
more.

One copy of my 90 million document index uses two servers to run all the 
shards.  Because I have two copies of the index, I have four servers.  
Each server has 64GB of RAM.  This is **NOT** running SolrCloud, but if 
it were, I would have zookeeper running on three of those servers.

Thanks,
Shawn

Re: Solr server requirements for 100+ million documents

Posted by Jason Hellman <jh...@innoventsolutions.com>.

Whether you use the same machines as Solr or separate machines is a matter suited to taste.

If you are the CTO, then you should make this decision.  If not, inform management that risk conditions are greater when you share function and control on a single piece of hardware.  A single failure of a replica + zookeeper node will be more impactful than a single failure of a replica *or* a zookeeper node.  Let them earn the big bucks to make the risk decision.

The good news is, zookeeper hardware can be extremely lightweight for Solr Cloud.  Commodity hardware should work just fine…and thus scaling to 5 nodes for zookeeper is not that hard at all.

Jason

On Feb 11, 2014, at 3:00 PM, svante karlsson <sa...@csi.se> wrote:

> ZK needs a quorum to keep functional so 3 servers handles one failure. 5
> handles 2 node failures. If you Solr with 1 replica per shard then stick to
> 3 ZK. If you use 2 replicas use 5 ZK
> 
> 
> 
> 
> 
>>

Re: Solr server requirements for 100+ million documents

Posted by svante karlsson <sa...@csi.se>.

ZK needs a quorum to keep functional so 3 servers handles one failure. 5
handles 2 node failures. If you Solr with 1 replica per shard then stick to
3 ZK. If you use 2 replicas use 5 ZK





>

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Thanks, Otis for quick reply. So for ZK do you recommend separate servers and if so how many for initial Solr cloud cluster setup. 

-----Original Message-----
From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com] 
Sent: Tuesday, February 11, 2014 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar < susheel.kumar@thedigitalgroup.net> wrote:

> Hi Otis,
>
> Just to confirm, the 3 servers you mean here are 2 for shards/nodes 
> and 1 for Zookeeper. Is that correct?
>
> Thanks,
> Susheel
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com]
> Sent: Friday, January 24, 2014 5:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Hi Susheel,
>
> Like Erick said, it's impossible to give precise recommendations, but 
> making a few assumptions and combining them with experience (+ a 
> licked finger in the air):
> * 3 servers
> * 32 GB
> * 2+ CPU cores
> * Linux
>
> Assuming docs are not bigger than a few KB, that they are not being 
> reindexed over and over, that you don't have a search rate higher than 
> a few dozen QPS, assuming your queries are not a page long, etc. 
> assuming best practices are followed, the above should be sufficient.
>
> I hope this helps.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics Solr & 
> Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar < 
> susheel.kumar@thedigitalgroup.net> wrote:
>
> > Hi,
> >
> > Currently we are indexing 10 million document from database (10 db 
> > data
> > entities) & index size is around 8 GB on windows virtual box. 
> > Indexing in one shot taking 12+ hours while indexing parallel in 
> > separate cores & merging them together taking 4+ hours.
> >
> > We are looking to scale to 100+ million documents and looking for 
> > recommendation on servers requirements on below parameters for a 
> > Production environment. There can be 200+ users performing search 
> > same
> time.
> >
> > No of physical servers (considering solr cloud) Memory requirement 
> > Processor requirement (# cores) Linux as OS oppose to windows
> >
> > Thanks in advance.
> > Susheel
> >
> >
>

Re: Solr server requirements for 100+ million documents

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar <
susheel.kumar@thedigitalgroup.net> wrote:

> Hi Otis,
>
> Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1
> for Zookeeper. Is that correct?
>
> Thanks,
> Susheel
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com]
> Sent: Friday, January 24, 2014 5:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Hi Susheel,
>
> Like Erick said, it's impossible to give precise recommendations, but
> making a few assumptions and combining them with experience (+ a licked
> finger in the air):
> * 3 servers
> * 32 GB
> * 2+ CPU cores
> * Linux
>
> Assuming docs are not bigger than a few KB, that they are not being
> reindexed over and over, that you don't have a search rate higher than a
> few dozen QPS, assuming your queries are not a page long, etc. assuming
> best practices are followed, the above should be sufficient.
>
> I hope this helps.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics Solr &
> Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar <
> susheel.kumar@thedigitalgroup.net> wrote:
>
> > Hi,
> >
> > Currently we are indexing 10 million document from database (10 db
> > data
> > entities) & index size is around 8 GB on windows virtual box. Indexing
> > in one shot taking 12+ hours while indexing parallel in separate cores
> > & merging them together taking 4+ hours.
> >
> > We are looking to scale to 100+ million documents and looking for
> > recommendation on servers requirements on below parameters for a
> > Production environment. There can be 200+ users performing search same
> time.
> >
> > No of physical servers (considering solr cloud) Memory requirement
> > Processor requirement (# cores) Linux as OS oppose to windows
> >
> > Thanks in advance.
> > Susheel
> >
> >
>

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Hi Otis,

Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1 for Zookeeper. Is that correct?

Thanks,
Susheel

-----Original Message-----
From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com] 
Sent: Friday, January 24, 2014 5:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Hi Susheel,

Like Erick said, it's impossible to give precise recommendations, but making a few assumptions and combining them with experience (+ a licked finger in the air):
* 3 servers
* 32 GB
* 2+ CPU cores
* Linux

Assuming docs are not bigger than a few KB, that they are not being reindexed over and over, that you don't have a search rate higher than a few dozen QPS, assuming your queries are not a page long, etc. assuming best practices are followed, the above should be sufficient.

I hope this helps.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/


On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar < susheel.kumar@thedigitalgroup.net> wrote:

> Hi,
>
> Currently we are indexing 10 million document from database (10 db 
> data
> entities) & index size is around 8 GB on windows virtual box. Indexing 
> in one shot taking 12+ hours while indexing parallel in separate cores 
> & merging them together taking 4+ hours.
>
> We are looking to scale to 100+ million documents and looking for 
> recommendation on servers requirements on below parameters for a 
> Production environment. There can be 200+ users performing search same time.
>
> No of physical servers (considering solr cloud) Memory requirement 
> Processor requirement (# cores) Linux as OS oppose to windows
>
> Thanks in advance.
> Susheel
>
>

Re: Solr server requirements for 100+ million documents

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Susheel,

Like Erick said, it's impossible to give precise recommendations, but
making a few assumptions and combining them with experience (+ a licked
finger in the air):
* 3 servers
* 32 GB
* 2+ CPU cores
* Linux

Assuming docs are not bigger than a few KB, that they are not being
reindexed over and over, that you don't have a search rate higher than a
few dozen QPS, assuming your queries are not a page long, etc. assuming
best practices are followed, the above should be sufficient.

I hope this helps.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar <
susheel.kumar@thedigitalgroup.net> wrote:

> Hi,
>
> Currently we are indexing 10 million document from database (10 db data
> entities) & index size is around 8 GB on windows virtual box. Indexing in
> one shot taking 12+ hours while indexing parallel in separate cores &
> merging them together taking 4+ hours.
>
> We are looking to scale to 100+ million documents and looking for
> recommendation on servers requirements on below parameters for a Production
> environment. There can be 200+ users performing search same time.
>
> No of physical servers (considering solr cloud)
> Memory requirement
> Processor requirement (# cores)
> Linux as OS oppose to windows
>
> Thanks in advance.
> Susheel
>
>

Re: Solr server requirements for 100+ million documents

Posted by svante karlsson <sa...@csi.se>.

You are of course right but we do our own normalization (among other things
"to_lower") before we insert and before search queries get entered.

We do not use wildcards in searches either so in our problem domain it
works quite well.

/svante




2014/1/25 Erick Erickson <er...@gmail.com>

> Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string"
> types. This is tangential to your question, but I thought I'd butt in
> anyway.
>
> String types are totally unanalyzed. So if the input for a field is "I
> like Strings",
> the only match will be "I like Strings". "I like strings" won't match
> due to the
> lower-case 's' in strings. "like" won't match since it isn't the complete
> input.
>
> You may already know this, but thought I'd point it out. For tokenized
> searches, text_general is a good place to start. Pardon me if this is
> repeating
> what you already know....
>
> Lots of string types sometimes lead people with DB backgrounds to
> search for *like* which will be slow FWIW.
>
> Best,
> Erick
>
> On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson <sa...@csi.se> wrote:
> > That got away a little early...
> >
> > The inserter is a small C++ program that uses pglib to speek to postgres
> > and the a http-client library that uses libcurl under the hood. The
> > inserter draws very little CPU and we normally use 2 writer threads that
> > each posts 1000 records at a time. Its very inefficient to post one at a
> > time but I've not done any specific testing to know if 1000 is better
> that
> > 500....
> >
> > What we're doing now is trying to figure out how to get the query
> > performance up since is not where we need it to be so we're not done
> > either...
> >
> >
> > 2014/1/25 svante karlsson <sa...@csi.se>
> >
> >> We are using a postgres server on a different host (same hardware as the
> >> test solr server). The reason we take the data from the postgres server
> is
> >> that is easy to automate testing since we use the same server to produce
> >> queries. In production we preload the solr from a csv file from a hive
> >> (hadoop) job and then only write updates ( < 500 / sec ). In our
> usecase we
> >> use solr as NoSQL dabase since we really want to do SHOULD queries
> against
> >> all the fields. The fields are typically very small text fields (<30
> chars)
> >> but occasionally bigger but I don't think I have more than 128 chars on
> >> anything in the whole dataset.
> >>
> >> <?xml version="1.0" encoding="UTF-8" ?>
> >> <schema name="example" version="1.1">
> >>   <types>
> >>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
> >>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="boolean" class="solr.BoolField"
> >> sortMissingLast="true"/>
> >>    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
> >> positionIncrementGap="0"/>
> >>    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> >> positionIncrementGap="0"/>
> >>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> >> positionIncrementGap="0"/>
> >>    </types>
> >> <fields>
> >> <field name="_version_" type="long" indexed="true" stored="true"
> >> multiValued="false"/>
> >> <field name="id" type="string" indexed="true" stored="true"
> >> required="true" multiValued="false" />
> >> <field name="name" type="int" indexed="true" stored="true"/>
> >> <field name="fieldA" type="string" indexed="true" stored="true"/>
> >> <field name="fieldB" type="string" indexed="true" stored="true"/>
> >> <field name="fieldC" type="int" indexed="true" stored="true"/>
> >> <field name="fieldD" type="int" indexed="true" stored="true"/>
> >> <field name="fieldE" type="int" indexed="true" stored="true"/>
> >> <field name="fieldF" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldG" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldH" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldI" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldJ" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldK" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldL" type="string" indexed="true" stored="true"/>
> >> <field name="fieldM" type="string" indexed="true" stored="true"
> >> multiValued="true"/>
> >> <field name="fieldN" type="string" indexed="true" stored="true"/>
> >>
> >> <field name="fieldO" type="string" indexed="false" stored="true"
> >> required="false" />
> >> <field name="ts"  type="long" indexed="true" stored="true"/>
> >> </fields>
> >> <uniqueKey>id</uniqueKey>
> >> <solrQueryParser defaultOperator="OR"/>
> >> </schema>
> >>
> >>
> >>
> >>
> >>
> >> 2014/1/25 Kranti Parisa <kr...@gmail.com>
> >>
> >>> can you post the complete solrconfig.xml file and schema.xml files to
> >>> review all of your settings that would impact your indexing
> performance.
> >>>
> >>> Thanks,
> >>> Kranti K. Parisa
> >>> http://www.linkedin.com/in/krantiparisa
> >>>
> >>>
> >>>
> >>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> >>> susheel.kumar@thedigitalgroup.net> wrote:
> >>>
> >>> > Thanks, Svante. Your indexing speed using db seems to really fast.
> Can
> >>> you
> >>> > please provide some more detail on how you are indexing db records.
> Is
> >>> it
> >>> > thru DataImportHandler? And what database? Is that local db?  We are
> >>> > indexing around 70 fields (60 multivalued) but data is not populated
> >>> always
> >>> > in all fields. The average size of document is in 5-10 kbs.
> >>> >
> >>> > -----Original Message-----
> >>> > From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On
> Behalf Of
> >>> > svante karlsson
> >>> > Sent: Friday, January 24, 2014 5:05 PM
> >>> > To: solr-user@lucene.apache.org
> >>> > Subject: Re: Solr server requirements for 100+ million documents
> >>> >
> >>> > I just indexed 100 million db docs (records) with 22 fields (4
> >>> > multivalued) in 9524 sec using libcurl.
> >>> > 11 million took 763 seconds so the speed drops somewhat with
> increasing
> >>> > dbsize.
> >>> >
> >>> > We write 1000 docs (just an arbitrary number) in each request from
> two
> >>> > threads. If you will be using solrcloud you will want more writer
> >>> threads.
> >>> >
> >>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
> one
> >>> SSD
> >>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
> >>> machine.
> >>> >
> >>> > /svante
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
> >>> >
> >>> > > Thanks, Erick for the info.
> >>> > >
> >>> > > For indexing I agree the more time is consumed in data acquisition
> >>> > > which in our case from Database.  For indexing currently we are
> using
> >>> > > the manual process i.e. Solr dashboard Data Import but now looking
> to
> >>> > > automate.  How do you suggest to automate the index part. Do you
> >>> > > recommend to use SolrJ or should we try to automate using Curl?
> >>> > >
> >>> > >
> >>> > > -----Original Message-----
> >>> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> >>> > > Sent: Friday, January 24, 2014 2:59 PM
> >>> > > To: solr-user@lucene.apache.org
> >>> > > Subject: Re: Solr server requirements for 100+ million documents
> >>> > >
> >>> > > Can't be done with the information you provided, and can only be
> >>> > > guessed at even with more comprehensive information.
> >>> > >
> >>> > > Here's why:
> >>> > >
> >>> > >
> >>> > >
> >>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
> >>> > > -dont-have-a-definitive-answer/
> >>> > >
> >>> > > Also, at a guess, your indexing speed is so slow due to data
> >>> > > acquisition; I rather doubt you're being limited by raw Solr
> indexing.
> >>> > > If you're using SolrJ, try commenting out the
> >>> > > server.add() bit and running again. My guess is that your indexing
> >>> > > speed will be almost unchanged, in which case it's the data
> >>> > > acquisition process is where you should concentrate efforts. As a
> >>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45
> minutes
> >>> > > without any attempts at parallelization.
> >>> > >
> >>> > >
> >>> > > Best,
> >>> > > Erick
> >>> > >
> >>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> >>> > > susheel.kumar@thedigitalgroup.net> wrote:
> >>> > > > Hi,
> >>> > > >
> >>> > > > Currently we are indexing 10 million document from database (10
> db
> >>> > > > data
> >>> > > entities) & index size is around 8 GB on windows virtual box.
> Indexing
> >>> > > in one shot taking 12+ hours while indexing parallel in separate
> cores
> >>> > > & merging them together taking 4+ hours.
> >>> > > >
> >>> > > > We are looking to scale to 100+ million documents and looking for
> >>> > > recommendation on servers requirements on below parameters for a
> >>> > > Production environment. There can be 200+ users performing search
> same
> >>> > time.
> >>> > > >
> >>> > > > No of physical servers (considering solr cloud) Memory
> requirement
> >>> > > > Processor requirement (# cores) Linux as OS oppose to windows
> >>> > > >
> >>> > > > Thanks in advance.
> >>> > > > Susheel
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
>

Re: Solr server requirements for 100+ million documents

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string"
types. This is tangential to your question, but I thought I'd butt in anyway.

String types are totally unanalyzed. So if the input for a field is "I
like Strings",
the only match will be "I like Strings". "I like strings" won't match
due to the
lower-case 's' in strings. "like" won't match since it isn't the complete input.

You may already know this, but thought I'd point it out. For tokenized
searches, text_general is a good place to start. Pardon me if this is repeating
what you already know....

Lots of string types sometimes lead people with DB backgrounds to
search for *like* which will be slow FWIW.

Best,
Erick

On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson <sa...@csi.se> wrote:
> That got away a little early...
>
> The inserter is a small C++ program that uses pglib to speek to postgres
> and the a http-client library that uses libcurl under the hood. The
> inserter draws very little CPU and we normally use 2 writer threads that
> each posts 1000 records at a time. Its very inefficient to post one at a
> time but I've not done any specific testing to know if 1000 is better that
> 500....
>
> What we're doing now is trying to figure out how to get the query
> performance up since is not where we need it to be so we're not done
> either...
>
>
> 2014/1/25 svante karlsson <sa...@csi.se>
>
>> We are using a postgres server on a different host (same hardware as the
>> test solr server). The reason we take the data from the postgres server is
>> that is easy to automate testing since we use the same server to produce
>> queries. In production we preload the solr from a csv file from a hive
>> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
>> use solr as NoSQL dabase since we really want to do SHOULD queries against
>> all the fields. The fields are typically very small text fields (<30 chars)
>> but occasionally bigger but I don't think I have more than 128 chars on
>> anything in the whole dataset.
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>> <schema name="example" version="1.1">
>>   <types>
>>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>> omitNorms="true"/>
>>    <fieldType name="boolean" class="solr.BoolField"
>> sortMissingLast="true"/>
>>    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
>> positionIncrementGap="0"/>
>>    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
>> positionIncrementGap="0"/>
>>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
>> positionIncrementGap="0"/>
>>    </types>
>> <fields>
>> <field name="_version_" type="long" indexed="true" stored="true"
>> multiValued="false"/>
>> <field name="id" type="string" indexed="true" stored="true"
>> required="true" multiValued="false" />
>> <field name="name" type="int" indexed="true" stored="true"/>
>> <field name="fieldA" type="string" indexed="true" stored="true"/>
>> <field name="fieldB" type="string" indexed="true" stored="true"/>
>> <field name="fieldC" type="int" indexed="true" stored="true"/>
>> <field name="fieldD" type="int" indexed="true" stored="true"/>
>> <field name="fieldE" type="int" indexed="true" stored="true"/>
>> <field name="fieldF" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldG" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldH" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldI" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldJ" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldK" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldL" type="string" indexed="true" stored="true"/>
>> <field name="fieldM" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldN" type="string" indexed="true" stored="true"/>
>>
>> <field name="fieldO" type="string" indexed="false" stored="true"
>> required="false" />
>> <field name="ts"  type="long" indexed="true" stored="true"/>
>> </fields>
>> <uniqueKey>id</uniqueKey>
>> <solrQueryParser defaultOperator="OR"/>
>> </schema>
>>
>>
>>
>>
>>
>> 2014/1/25 Kranti Parisa <kr...@gmail.com>
>>
>>> can you post the complete solrconfig.xml file and schema.xml files to
>>> review all of your settings that would impact your indexing performance.
>>>
>>> Thanks,
>>> Kranti K. Parisa
>>> http://www.linkedin.com/in/krantiparisa
>>>
>>>
>>>
>>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
>>> susheel.kumar@thedigitalgroup.net> wrote:
>>>
>>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
>>> you
>>> > please provide some more detail on how you are indexing db records. Is
>>> it
>>> > thru DataImportHandler? And what database? Is that local db?  We are
>>> > indexing around 70 fields (60 multivalued) but data is not populated
>>> always
>>> > in all fields. The average size of document is in 5-10 kbs.
>>> >
>>> > -----Original Message-----
>>> > From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf Of
>>> > svante karlsson
>>> > Sent: Friday, January 24, 2014 5:05 PM
>>> > To: solr-user@lucene.apache.org
>>> > Subject: Re: Solr server requirements for 100+ million documents
>>> >
>>> > I just indexed 100 million db docs (records) with 22 fields (4
>>> > multivalued) in 9524 sec using libcurl.
>>> > 11 million took 763 seconds so the speed drops somewhat with increasing
>>> > dbsize.
>>> >
>>> > We write 1000 docs (just an arbitrary number) in each request from two
>>> > threads. If you will be using solrcloud you will want more writer
>>> threads.
>>> >
>>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>>> SSD
>>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
>>> machine.
>>> >
>>> > /svante
>>> >
>>> >
>>> >
>>> >
>>> > 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>> >
>>> > > Thanks, Erick for the info.
>>> > >
>>> > > For indexing I agree the more time is consumed in data acquisition
>>> > > which in our case from Database.  For indexing currently we are using
>>> > > the manual process i.e. Solr dashboard Data Import but now looking to
>>> > > automate.  How do you suggest to automate the index part. Do you
>>> > > recommend to use SolrJ or should we try to automate using Curl?
>>> > >
>>> > >
>>> > > -----Original Message-----
>>> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> > > Sent: Friday, January 24, 2014 2:59 PM
>>> > > To: solr-user@lucene.apache.org
>>> > > Subject: Re: Solr server requirements for 100+ million documents
>>> > >
>>> > > Can't be done with the information you provided, and can only be
>>> > > guessed at even with more comprehensive information.
>>> > >
>>> > > Here's why:
>>> > >
>>> > >
>>> > >
>>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
>>> > > -dont-have-a-definitive-answer/
>>> > >
>>> > > Also, at a guess, your indexing speed is so slow due to data
>>> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
>>> > > If you're using SolrJ, try commenting out the
>>> > > server.add() bit and running again. My guess is that your indexing
>>> > > speed will be almost unchanged, in which case it's the data
>>> > > acquisition process is where you should concentrate efforts. As a
>>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
>>> > > without any attempts at parallelization.
>>> > >
>>> > >
>>> > > Best,
>>> > > Erick
>>> > >
>>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>>> > > susheel.kumar@thedigitalgroup.net> wrote:
>>> > > > Hi,
>>> > > >
>>> > > > Currently we are indexing 10 million document from database (10 db
>>> > > > data
>>> > > entities) & index size is around 8 GB on windows virtual box. Indexing
>>> > > in one shot taking 12+ hours while indexing parallel in separate cores
>>> > > & merging them together taking 4+ hours.
>>> > > >
>>> > > > We are looking to scale to 100+ million documents and looking for
>>> > > recommendation on servers requirements on below parameters for a
>>> > > Production environment. There can be 200+ users performing search same
>>> > time.
>>> > > >
>>> > > > No of physical servers (considering solr cloud) Memory requirement
>>> > > > Processor requirement (# cores) Linux as OS oppose to windows
>>> > > >
>>> > > > Thanks in advance.
>>> > > > Susheel
>>> > > >
>>> > >
>>> >
>>>
>>
>>

Re: Solr server requirements for 100+ million documents

Posted by svante karlsson <sa...@csi.se>.

That got away a little early...

The inserter is a small C++ program that uses pglib to speek to postgres
and the a http-client library that uses libcurl under the hood. The
inserter draws very little CPU and we normally use 2 writer threads that
each posts 1000 records at a time. Its very inefficient to post one at a
time but I've not done any specific testing to know if 1000 is better that
500....

What we're doing now is trying to figure out how to get the query
performance up since is not where we need it to be so we're not done
either...


2014/1/25 svante karlsson <sa...@csi.se>

> We are using a postgres server on a different host (same hardware as the
> test solr server). The reason we take the data from the postgres server is
> that is easy to automate testing since we use the same server to produce
> queries. In production we preload the solr from a csv file from a hive
> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
> use solr as NoSQL dabase since we really want to do SHOULD queries against
> all the fields. The fields are typically very small text fields (<30 chars)
> but occasionally bigger but I don't think I have more than 128 chars on
> anything in the whole dataset.
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <schema name="example" version="1.1">
>   <types>
>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="boolean" class="solr.BoolField"
> sortMissingLast="true"/>
>    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
> positionIncrementGap="0"/>
>    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> positionIncrementGap="0"/>
>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0"/>
>    </types>
> <fields>
> <field name="_version_" type="long" indexed="true" stored="true"
> multiValued="false"/>
> <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
> <field name="name" type="int" indexed="true" stored="true"/>
> <field name="fieldA" type="string" indexed="true" stored="true"/>
> <field name="fieldB" type="string" indexed="true" stored="true"/>
> <field name="fieldC" type="int" indexed="true" stored="true"/>
> <field name="fieldD" type="int" indexed="true" stored="true"/>
> <field name="fieldE" type="int" indexed="true" stored="true"/>
> <field name="fieldF" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldG" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldH" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldI" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldJ" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldK" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldL" type="string" indexed="true" stored="true"/>
> <field name="fieldM" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldN" type="string" indexed="true" stored="true"/>
>
> <field name="fieldO" type="string" indexed="false" stored="true"
> required="false" />
> <field name="ts"  type="long" indexed="true" stored="true"/>
> </fields>
> <uniqueKey>id</uniqueKey>
> <solrQueryParser defaultOperator="OR"/>
> </schema>
>
>
>
>
>
> 2014/1/25 Kranti Parisa <kr...@gmail.com>
>
>> can you post the complete solrconfig.xml file and schema.xml files to
>> review all of your settings that would impact your indexing performance.
>>
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>>
>>
>>
>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
>> susheel.kumar@thedigitalgroup.net> wrote:
>>
>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
>> you
>> > please provide some more detail on how you are indexing db records. Is
>> it
>> > thru DataImportHandler? And what database? Is that local db?  We are
>> > indexing around 70 fields (60 multivalued) but data is not populated
>> always
>> > in all fields. The average size of document is in 5-10 kbs.
>> >
>> > -----Original Message-----
>> > From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf Of
>> > svante karlsson
>> > Sent: Friday, January 24, 2014 5:05 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > I just indexed 100 million db docs (records) with 22 fields (4
>> > multivalued) in 9524 sec using libcurl.
>> > 11 million took 763 seconds so the speed drops somewhat with increasing
>> > dbsize.
>> >
>> > We write 1000 docs (just an arbitrary number) in each request from two
>> > threads. If you will be using solrcloud you will want more writer
>> threads.
>> >
>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>> SSD
>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
>> machine.
>> >
>> > /svante
>> >
>> >
>> >
>> >
>> > 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>> >
>> > > Thanks, Erick for the info.
>> > >
>> > > For indexing I agree the more time is consumed in data acquisition
>> > > which in our case from Database.  For indexing currently we are using
>> > > the manual process i.e. Solr dashboard Data Import but now looking to
>> > > automate.  How do you suggest to automate the index part. Do you
>> > > recommend to use SolrJ or should we try to automate using Curl?
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > > Sent: Friday, January 24, 2014 2:59 PM
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Solr server requirements for 100+ million documents
>> > >
>> > > Can't be done with the information you provided, and can only be
>> > > guessed at even with more comprehensive information.
>> > >
>> > > Here's why:
>> > >
>> > >
>> > >
>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
>> > > -dont-have-a-definitive-answer/
>> > >
>> > > Also, at a guess, your indexing speed is so slow due to data
>> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > > If you're using SolrJ, try commenting out the
>> > > server.add() bit and running again. My guess is that your indexing
>> > > speed will be almost unchanged, in which case it's the data
>> > > acquisition process is where you should concentrate efforts. As a
>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
>> > > without any attempts at parallelization.
>> > >
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>> > > susheel.kumar@thedigitalgroup.net> wrote:
>> > > > Hi,
>> > > >
>> > > > Currently we are indexing 10 million document from database (10 db
>> > > > data
>> > > entities) & index size is around 8 GB on windows virtual box. Indexing
>> > > in one shot taking 12+ hours while indexing parallel in separate cores
>> > > & merging them together taking 4+ hours.
>> > > >
>> > > > We are looking to scale to 100+ million documents and looking for
>> > > recommendation on servers requirements on below parameters for a
>> > > Production environment. There can be 200+ users performing search same
>> > time.
>> > > >
>> > > > No of physical servers (considering solr cloud) Memory requirement
>> > > > Processor requirement (# cores) Linux as OS oppose to windows
>> > > >
>> > > > Thanks in advance.
>> > > > Susheel
>> > > >
>> > >
>> >
>>
>
>

Re: Solr server requirements for 100+ million documents

Posted by svante karlsson <sa...@csi.se>.

We are using a postgres server on a different host (same hardware as the
test solr server). The reason we take the data from the postgres server is
that is easy to automate testing since we use the same server to produce
queries. In production we preload the solr from a csv file from a hive
(hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
use solr as NoSQL dabase since we really want to do SHOULD queries against
all the fields. The fields are typically very small text fields (<30 chars)
but occasionally bigger but I don't think I have more than 128 chars on
anything in the whole dataset.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.1">
  <types>
  <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
  <fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
   <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
   <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
positionIncrementGap="0"/>
   <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
positionIncrementGap="0"/>
   <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0"/>
   </types>
<fields>
<field name="_version_" type="long" indexed="true" stored="true"
multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true"
multiValued="false" />
<field name="name" type="int" indexed="true" stored="true"/>
<field name="fieldA" type="string" indexed="true" stored="true"/>
<field name="fieldB" type="string" indexed="true" stored="true"/>
<field name="fieldC" type="int" indexed="true" stored="true"/>
<field name="fieldD" type="int" indexed="true" stored="true"/>
<field name="fieldE" type="int" indexed="true" stored="true"/>
<field name="fieldF" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldG" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldH" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldI" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldJ" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldK" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldL" type="string" indexed="true" stored="true"/>
<field name="fieldM" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="fieldN" type="string" indexed="true" stored="true"/>

<field name="fieldO" type="string" indexed="false" stored="true"
required="false" />
<field name="ts"  type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<solrQueryParser defaultOperator="OR"/>
</schema>





2014/1/25 Kranti Parisa <kr...@gmail.com>

> can you post the complete solrconfig.xml file and schema.xml files to
> review all of your settings that would impact your indexing performance.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> susheel.kumar@thedigitalgroup.net> wrote:
>
> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
> you
> > please provide some more detail on how you are indexing db records. Is it
> > thru DataImportHandler? And what database? Is that local db?  We are
> > indexing around 70 fields (60 multivalued) but data is not populated
> always
> > in all fields. The average size of document is in 5-10 kbs.
> >
> > -----Original Message-----
> > From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf Of
> > svante karlsson
> > Sent: Friday, January 24, 2014 5:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > I just indexed 100 million db docs (records) with 22 fields (4
> > multivalued) in 9524 sec using libcurl.
> > 11 million took 763 seconds so the speed drops somewhat with increasing
> > dbsize.
> >
> > We write 1000 docs (just an arbitrary number) in each request from two
> > threads. If you will be using solrcloud you will want more writer
> threads.
> >
> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
> SSD
> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
> >
> > /svante
> >
> >
> >
> >
> > 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
> >
> > > Thanks, Erick for the info.
> > >
> > > For indexing I agree the more time is consumed in data acquisition
> > > which in our case from Database.  For indexing currently we are using
> > > the manual process i.e. Solr dashboard Data Import but now looking to
> > > automate.  How do you suggest to automate the index part. Do you
> > > recommend to use SolrJ or should we try to automate using Curl?
> > >
> > >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Friday, January 24, 2014 2:59 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Solr server requirements for 100+ million documents
> > >
> > > Can't be done with the information you provided, and can only be
> > > guessed at even with more comprehensive information.
> > >
> > > Here's why:
> > >
> > >
> > > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
> > > -dont-have-a-definitive-answer/
> > >
> > > Also, at a guess, your indexing speed is so slow due to data
> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
> > > If you're using SolrJ, try commenting out the
> > > server.add() bit and running again. My guess is that your indexing
> > > speed will be almost unchanged, in which case it's the data
> > > acquisition process is where you should concentrate efforts. As a
> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
> > > without any attempts at parallelization.
> > >
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> > > susheel.kumar@thedigitalgroup.net> wrote:
> > > > Hi,
> > > >
> > > > Currently we are indexing 10 million document from database (10 db
> > > > data
> > > entities) & index size is around 8 GB on windows virtual box. Indexing
> > > in one shot taking 12+ hours while indexing parallel in separate cores
> > > & merging them together taking 4+ hours.
> > > >
> > > > We are looking to scale to 100+ million documents and looking for
> > > recommendation on servers requirements on below parameters for a
> > > Production environment. There can be 200+ users performing search same
> > time.
> > > >
> > > > No of physical servers (considering solr cloud) Memory requirement
> > > > Processor requirement (# cores) Linux as OS oppose to windows
> > > >
> > > > Thanks in advance.
> > > > Susheel
> > > >
> > >
> >
>

Re: Solr server requirements for 100+ million documents

Posted by simon <mt...@gmail.com>.

Erick's probably too modest to say so ;=) , but he wrote a great blog entry
on indexing with SolrJ -
http://searchhub.org/2012/02/14/indexing-with-solrj/ .  I took the guts of
the code in that blog and  easily customized it to write a very fast
indexer  (content from MySQL, I excised all the Tika code as I am not using
it).

You should replace StreamingUpdateSolrServer by  ConcurrentUpdateSolrServer
and experiment to find the optimal number of threads to configure.

-Simon


On Sun, Jan 26, 2014 at 11:28 AM, Erick Erickson <er...@gmail.com>wrote:

> 1> That's what I'd do. For incremental updates you might have to
> create a trigger on the main table and insert rows into another table
> that is then used to do the incremental updates. This is particularly
> relevant for deletes. Consider the case where you've ingested all your
> data then rows are deleted. Removing those same documents from Solr
> requires either a> re-indexing everything or b> getting all the docs
> in Solr and comparing them with the rows in the DB etc. This is
> expensive. c> recording the changes as above and just processing
> deletes from the "change table".
>
> 2> SolrJ is usually the most current. I don't know how much work
> SolrNet gets. However, under the covers it's just HTTP calls so since
> you have access in either to just adding HTTP parameters, you should
> be able to get the full functionality out of either. I _think_ that
> I'd go with whatever you're most comfortable with.
>
> Best,
> Erick
>
> On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar
> <su...@thedigitalgroup.net> wrote:
> > Thank you Erick for your valuable inputs. Yes, we have to re-index data
> again & again. I'll look into possibility of tuning db access.
> >
> > On SolrJ and automating the indexing (incremental as well as one time) I
> want to get your opinion on below two points. We will be indexing separate
> sets of tables with similar data structure
> >
> > - Should we use SolrJ and write Java programs that can be scheduled to
> trigger indexing on demand/schedule based.
> >
> > - Is using SolrJ a better idea even for searching than using SolrNet? As
> our frontend is in .Net so we started using SolrNet but I am afraid down
> the road when we scale/support SolrClod using SolrJ is better?
> >
> >
> > Thanks
> > Susheel
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Sunday, January 26, 2014 8:37 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Dumping the raw data would probably be a good idea. I guarantee you'll
> be re-indexing the data several times as you change the schema to
> accommodate different requirements...
> >
> > But it may also be worth spending some time figuring out why the DB
> access is slow. Sometimes one can tune that.
> >
> > If you go the SolrJ route, you also have the possibility of setting up N
> clients to work simultaneously, sometimes that'll help.
> >
> > FWIW,
> > Erick
> >
> > On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <
> susheel.kumar@thedigitalgroup.net> wrote:
> >> Hi Kranti,
> >>
> >> Attach are the solrconfig & schema xml for review. I did run indexing
> with just few fields (5-6 fields) in schema.xml & keeping the same db
> config but Indexing almost still taking similar time (average 1 million
> records 1 hr) which confirms that the bottleneck is in the data acquisition
> which in our case is oracle database. I am thinking to not use
> dataimporthandler / jdbc to get data from Oracle but to rather dump data
> somehow from oracle using SQL loader and then index it. Any thoughts?
> >>
> >> Thnx
> >>
> >> -----Original Message-----
> >> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
> >> Sent: Saturday, January 25, 2014 12:08 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Solr server requirements for 100+ million documents
> >>
> >> can you post the complete solrconfig.xml file and schema.xml files to
> review all of your settings that would impact your indexing performance.
> >>
> >> Thanks,
> >> Kranti K. Parisa
> >> http://www.linkedin.com/in/krantiparisa
> >>
> >>
> >>
> >> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> susheel.kumar@thedigitalgroup.net> wrote:
> >>
> >>> Thanks, Svante. Your indexing speed using db seems to really fast.
> >>> Can you please provide some more detail on how you are indexing db
> >>> records. Is it thru DataImportHandler? And what database? Is that
> >>> local db?  We are indexing around 70 fields (60 multivalued) but data
> >>> is not populated always in all fields. The average size of document is
> in 5-10 kbs.
> >>>
> >>> -----Original Message-----
> >>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf
> >>> Of svante karlsson
> >>> Sent: Friday, January 24, 2014 5:05 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Solr server requirements for 100+ million documents
> >>>
> >>> I just indexed 100 million db docs (records) with 22 fields (4
> >>> multivalued) in 9524 sec using libcurl.
> >>> 11 million took 763 seconds so the speed drops somewhat with
> >>> increasing dbsize.
> >>>
> >>> We write 1000 docs (just an arbitrary number) in each request from
> >>> two threads. If you will be using solrcloud you will want more writer
> threads.
> >>>
> >>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
> >>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi
> virtual machine.
> >>>
> >>> /svante
> >>>
> >>>
> >>>
> >>>
> >>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
> >>>
> >>> > Thanks, Erick for the info.
> >>> >
> >>> > For indexing I agree the more time is consumed in data acquisition
> >>> > which in our case from Database.  For indexing currently we are
> >>> > using the manual process i.e. Solr dashboard Data Import but now
> >>> > looking to automate.  How do you suggest to automate the index part.
> >>> > Do you recommend to use SolrJ or should we try to automate using
> Curl?
> >>> >
> >>> >
> >>> > -----Original Message-----
> >>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> >>> > Sent: Friday, January 24, 2014 2:59 PM
> >>> > To: solr-user@lucene.apache.org
> >>> > Subject: Re: Solr server requirements for 100+ million documents
> >>> >
> >>> > Can't be done with the information you provided, and can only be
> >>> > guessed at even with more comprehensive information.
> >>> >
> >>> > Here's why:
> >>> >
> >>> >
> >>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
> >>> > -
> >>> > we
> >>> > -dont-have-a-definitive-answer/
> >>> >
> >>> > Also, at a guess, your indexing speed is so slow due to data
> >>> > acquisition; I rather doubt you're being limited by raw Solr
> indexing.
> >>> > If you're using SolrJ, try commenting out the
> >>> > server.add() bit and running again. My guess is that your indexing
> >>> > speed will be almost unchanged, in which case it's the data
> >>> > acquisition process is where you should concentrate efforts. As a
> >>> > comparison, I can index 11M Wikipedia docs on my laptop in 45
> >>> > minutes without any attempts at parallelization.
> >>> >
> >>> >
> >>> > Best,
> >>> > Erick
> >>> >
> >>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> >>> > susheel.kumar@thedigitalgroup.net> wrote:
> >>> > > Hi,
> >>> > >
> >>> > > Currently we are indexing 10 million document from database (10
> >>> > > db data
> >>> > entities) & index size is around 8 GB on windows virtual box.
> >>> > Indexing in one shot taking 12+ hours while indexing parallel in
> >>> > separate cores & merging them together taking 4+ hours.
> >>> > >
> >>> > > We are looking to scale to 100+ million documents and looking for
> >>> > recommendation on servers requirements on below parameters for a
> >>> > Production environment. There can be 200+ users performing search
> >>> > same
> >>> time.
> >>> > >
> >>> > > No of physical servers (considering solr cloud) Memory
> >>> > > requirement Processor requirement (# cores) Linux as OS oppose to
> >>> > > windows
> >>> > >
> >>> > > Thanks in advance.
> >>> > > Susheel
> >>> > >
> >>> >
> >>>
>

Re: Solr server requirements for 100+ million documents

Posted by Erick Erickson <er...@gmail.com>.

1> That's what I'd do. For incremental updates you might have to
create a trigger on the main table and insert rows into another table
that is then used to do the incremental updates. This is particularly
relevant for deletes. Consider the case where you've ingested all your
data then rows are deleted. Removing those same documents from Solr
requires either a> re-indexing everything or b> getting all the docs
in Solr and comparing them with the rows in the DB etc. This is
expensive. c> recording the changes as above and just processing
deletes from the "change table".

2> SolrJ is usually the most current. I don't know how much work
SolrNet gets. However, under the covers it's just HTTP calls so since
you have access in either to just adding HTTP parameters, you should
be able to get the full functionality out of either. I _think_ that
I'd go with whatever you're most comfortable with.

Best,
Erick

On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar
<su...@thedigitalgroup.net> wrote:
> Thank you Erick for your valuable inputs. Yes, we have to re-index data again & again. I'll look into possibility of tuning db access.
>
> On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure
>
> - Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based.
>
> - Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better?
>
>
> Thanks
> Susheel
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, January 26, 2014 8:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements...
>
> But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that.
>
> If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help.
>
> FWIW,
> Erick
>
> On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:
>> Hi Kranti,
>>
>> Attach are the solrconfig & schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml & keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts?
>>
>> Thnx
>>
>> -----Original Message-----
>> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
>> Sent: Saturday, January 25, 2014 12:08 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr server requirements for 100+ million documents
>>
>> can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance.
>>
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>>
>>
>>
>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < susheel.kumar@thedigitalgroup.net> wrote:
>>
>>> Thanks, Svante. Your indexing speed using db seems to really fast.
>>> Can you please provide some more detail on how you are indexing db
>>> records. Is it thru DataImportHandler? And what database? Is that
>>> local db?  We are indexing around 70 fields (60 multivalued) but data
>>> is not populated always in all fields. The average size of document is in 5-10 kbs.
>>>
>>> -----Original Message-----
>>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf
>>> Of svante karlsson
>>> Sent: Friday, January 24, 2014 5:05 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr server requirements for 100+ million documents
>>>
>>> I just indexed 100 million db docs (records) with 22 fields (4
>>> multivalued) in 9524 sec using libcurl.
>>> 11 million took 763 seconds so the speed drops somewhat with
>>> increasing dbsize.
>>>
>>> We write 1000 docs (just an arbitrary number) in each request from
>>> two threads. If you will be using solrcloud you will want more writer threads.
>>>
>>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
>>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>>>
>>> /svante
>>>
>>>
>>>
>>>
>>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>>
>>> > Thanks, Erick for the info.
>>> >
>>> > For indexing I agree the more time is consumed in data acquisition
>>> > which in our case from Database.  For indexing currently we are
>>> > using the manual process i.e. Solr dashboard Data Import but now
>>> > looking to automate.  How do you suggest to automate the index part.
>>> > Do you recommend to use SolrJ or should we try to automate using Curl?
>>> >
>>> >
>>> > -----Original Message-----
>>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> > Sent: Friday, January 24, 2014 2:59 PM
>>> > To: solr-user@lucene.apache.org
>>> > Subject: Re: Solr server requirements for 100+ million documents
>>> >
>>> > Can't be done with the information you provided, and can only be
>>> > guessed at even with more comprehensive information.
>>> >
>>> > Here's why:
>>> >
>>> >
>>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
>>> > -
>>> > we
>>> > -dont-have-a-definitive-answer/
>>> >
>>> > Also, at a guess, your indexing speed is so slow due to data
>>> > acquisition; I rather doubt you're being limited by raw Solr indexing.
>>> > If you're using SolrJ, try commenting out the
>>> > server.add() bit and running again. My guess is that your indexing
>>> > speed will be almost unchanged, in which case it's the data
>>> > acquisition process is where you should concentrate efforts. As a
>>> > comparison, I can index 11M Wikipedia docs on my laptop in 45
>>> > minutes without any attempts at parallelization.
>>> >
>>> >
>>> > Best,
>>> > Erick
>>> >
>>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>>> > susheel.kumar@thedigitalgroup.net> wrote:
>>> > > Hi,
>>> > >
>>> > > Currently we are indexing 10 million document from database (10
>>> > > db data
>>> > entities) & index size is around 8 GB on windows virtual box.
>>> > Indexing in one shot taking 12+ hours while indexing parallel in
>>> > separate cores & merging them together taking 4+ hours.
>>> > >
>>> > > We are looking to scale to 100+ million documents and looking for
>>> > recommendation on servers requirements on below parameters for a
>>> > Production environment. There can be 200+ users performing search
>>> > same
>>> time.
>>> > >
>>> > > No of physical servers (considering solr cloud) Memory
>>> > > requirement Processor requirement (# cores) Linux as OS oppose to
>>> > > windows
>>> > >
>>> > > Thanks in advance.
>>> > > Susheel
>>> > >
>>> >
>>>

Re: Solr server requirements for 100+ million documents

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.

Previously in the list a spreadsheet has been mentioned, taking into account that you already have documents in an index you could extract the needed information from your index and feed it into the spreadsheet and it probably will give you a rough approximated of the hardware you’ll bee needing. Also if I’m not mistaken no SolrCloud approximation is provided by this “tool”.

Greetings!

On Jan 28, 2014, at 11:02 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:

> Thanks, Jack. That helps.
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.com] 
> Sent: Tuesday, January 28, 2014 8:01 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
> 
> Lucene and Solr work best if the full index can be cached in OS memory. 
> Sure, Lucene/Solr does work properly once the index no longer fits, but performance will drop off.
> 
> I would say that you could fit 100 million moderate-size documents on a single Solr server - provided that you give the OS enough RAM for the full Lucene index. That said, if you want to configure a SolrCloud cluster with shards, you can use more modest, commodity servers with less RAM, provided each server still fits it's fraction of the total Lucene index in that server's OS memory (file cache.)
> 
> You may also need to add replicas for each shard to accommodate query load - proof-of-concept testing is needed to verify that. It is worth noting that sharding can improve total query performance since each node only searches a fraction of the total data and those searches are done in parallel  (since they are on different machines.)
> 
> -- Jack Krupansky
> 
> -----Original Message-----
> From: Susheel Kumar
> Sent: Sunday, January 26, 2014 10:54 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr server requirements for 100+ million documents
> 
> Thank you Erick for your valuable inputs. Yes, we have to re-index data again & again. I'll look into possibility of tuning db access.
> 
> On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure
> 
> - Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based.
> 
> - Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better?
> 
> 
> Thanks
> Susheel
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, January 26, 2014 8:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
> 
> Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements...
> 
> But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that.
> 
> If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help.
> 
> FWIW,
> Erick
> 
> On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:
>> Hi Kranti,
>> 
>> Attach are the solrconfig & schema xml for review. I did run indexing 
>> with just few fields (5-6 fields) in schema.xml & keeping the same db 
>> config but Indexing almost still taking similar time (average 1 
>> million records 1
>> hr) which confirms that the bottleneck is in the data acquisition 
>> which in our case is oracle database. I am thinking to not use 
>> dataimporthandler / jdbc to get data from Oracle but to rather dump 
>> data somehow from oracle using SQL loader and then index it. Any thoughts?
>> 
>> Thnx
>> 
>> -----Original Message-----
>> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
>> Sent: Saturday, January 25, 2014 12:08 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr server requirements for 100+ million documents
>> 
>> can you post the complete solrconfig.xml file and schema.xml files to 
>> review all of your settings that would impact your indexing performance.
>> 
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>> 
>> 
>> 
>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < 
>> susheel.kumar@thedigitalgroup.net> wrote:
>> 
>>> Thanks, Svante. Your indexing speed using db seems to really fast.
>>> Can you please provide some more detail on how you are indexing db 
>>> records. Is it thru DataImportHandler? And what database? Is that 
>>> local db?  We are indexing around 70 fields (60 multivalued) but data 
>>> is not populated always in all fields. The average size of document 
>>> is in
>>> 5-10 kbs.
>>> 
>>> -----Original Message-----
>>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf 
>>> Of svante karlsson
>>> Sent: Friday, January 24, 2014 5:05 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr server requirements for 100+ million documents
>>> 
>>> I just indexed 100 million db docs (records) with 22 fields (4
>>> multivalued) in 9524 sec using libcurl.
>>> 11 million took 763 seconds so the speed drops somewhat with 
>>> increasing dbsize.
>>> 
>>> We write 1000 docs (just an arbitrary number) in each request from 
>>> two threads. If you will be using solrcloud you will want more writer 
>>> threads.
>>> 
>>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with 
>>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi 
>>> virtual machine.
>>> 
>>> /svante
>>> 
>>> 
>>> 
>>> 
>>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>> 
>>>> Thanks, Erick for the info.
>>>> 
>>>> For indexing I agree the more time is consumed in data acquisition 
>>>> which in our case from Database.  For indexing currently we are 
>>>> using the manual process i.e. Solr dashboard Data Import but now 
>>>> looking to automate.  How do you suggest to automate the index part.
>>>> Do you recommend to use SolrJ or should we try to automate using Curl?
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>>> Sent: Friday, January 24, 2014 2:59 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Solr server requirements for 100+ million documents
>>>> 
>>>> Can't be done with the information you provided, and can only be 
>>>> guessed at even with more comprehensive information.
>>>> 
>>>> Here's why:
>>>> 
>>>> 
>>>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
>>>> -
>>>> we
>>>> -dont-have-a-definitive-answer/
>>>> 
>>>> Also, at a guess, your indexing speed is so slow due to data 
>>>> acquisition; I rather doubt you're being limited by raw Solr indexing.
>>>> If you're using SolrJ, try commenting out the
>>>> server.add() bit and running again. My guess is that your indexing 
>>>> speed will be almost unchanged, in which case it's the data 
>>>> acquisition process is where you should concentrate efforts. As a 
>>>> comparison, I can index 11M Wikipedia docs on my laptop in 45 
>>>> minutes without any attempts at parallelization.
>>>> 
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
>>>> susheel.kumar@thedigitalgroup.net> wrote:
>>>>> Hi,
>>>>> 
>>>>> Currently we are indexing 10 million document from database (10 
>>>>> db data
>>>> entities) & index size is around 8 GB on windows virtual box.
>>>> Indexing in one shot taking 12+ hours while indexing parallel in 
>>>> separate cores & merging them together taking 4+ hours.
>>>>> 
>>>>> We are looking to scale to 100+ million documents and looking for
>>>> recommendation on servers requirements on below parameters for a 
>>>> Production environment. There can be 200+ users performing search 
>>>> same
>>> time.
>>>>> 
>>>>> No of physical servers (considering solr cloud) Memory 
>>>>> requirement Processor requirement (# cores) Linux as OS oppose to 
>>>>> windows
>>>>> 
>>>>> Thanks in advance.
>>>>> Susheel
>>>>> 
>>>> 
>>> 
> 

________________________________________________________________________________________________
III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Thanks, Jack. That helps.

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Tuesday, January 28, 2014 8:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Lucene and Solr work best if the full index can be cached in OS memory. 
Sure, Lucene/Solr does work properly once the index no longer fits, but performance will drop off.

I would say that you could fit 100 million moderate-size documents on a single Solr server - provided that you give the OS enough RAM for the full Lucene index. That said, if you want to configure a SolrCloud cluster with shards, you can use more modest, commodity servers with less RAM, provided each server still fits it's fraction of the total Lucene index in that server's OS memory (file cache.)

You may also need to add replicas for each shard to accommodate query load - proof-of-concept testing is needed to verify that. It is worth noting that sharding can improve total query performance since each node only searches a fraction of the total data and those searches are done in parallel  (since they are on different machines.)

-- Jack Krupansky

-----Original Message-----
From: Susheel Kumar
Sent: Sunday, January 26, 2014 10:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr server requirements for 100+ million documents

Thank you Erick for your valuable inputs. Yes, we have to re-index data again & again. I'll look into possibility of tuning db access.

On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure

- Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based.

- Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better?


Thanks
Susheel
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Sunday, January 26, 2014 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements...

But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:
> Hi Kranti,
>
> Attach are the solrconfig & schema xml for review. I did run indexing 
> with just few fields (5-6 fields) in schema.xml & keeping the same db 
> config but Indexing almost still taking similar time (average 1 
> million records 1
> hr) which confirms that the bottleneck is in the data acquisition 
> which in our case is oracle database. I am thinking to not use 
> dataimporthandler / jdbc to get data from Oracle but to rather dump 
> data somehow from oracle using SQL loader and then index it. Any thoughts?
>
> Thnx
>
> -----Original Message-----
> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
> Sent: Saturday, January 25, 2014 12:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> can you post the complete solrconfig.xml file and schema.xml files to 
> review all of your settings that would impact your indexing performance.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < 
> susheel.kumar@thedigitalgroup.net> wrote:
>
>> Thanks, Svante. Your indexing speed using db seems to really fast.
>> Can you please provide some more detail on how you are indexing db 
>> records. Is it thru DataImportHandler? And what database? Is that 
>> local db?  We are indexing around 70 fields (60 multivalued) but data 
>> is not populated always in all fields. The average size of document 
>> is in
>> 5-10 kbs.
>>
>> -----Original Message-----
>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf 
>> Of svante karlsson
>> Sent: Friday, January 24, 2014 5:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr server requirements for 100+ million documents
>>
>> I just indexed 100 million db docs (records) with 22 fields (4
>> multivalued) in 9524 sec using libcurl.
>> 11 million took 763 seconds so the speed drops somewhat with 
>> increasing dbsize.
>>
>> We write 1000 docs (just an arbitrary number) in each request from 
>> two threads. If you will be using solrcloud you will want more writer 
>> threads.
>>
>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with 
>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi 
>> virtual machine.
>>
>> /svante
>>
>>
>>
>>
>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>
>> > Thanks, Erick for the info.
>> >
>> > For indexing I agree the more time is consumed in data acquisition 
>> > which in our case from Database.  For indexing currently we are 
>> > using the manual process i.e. Solr dashboard Data Import but now 
>> > looking to automate.  How do you suggest to automate the index part.
>> > Do you recommend to use SolrJ or should we try to automate using Curl?
>> >
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > Sent: Friday, January 24, 2014 2:59 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > Can't be done with the information you provided, and can only be 
>> > guessed at even with more comprehensive information.
>> >
>> > Here's why:
>> >
>> >
>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
>> > -
>> > we
>> > -dont-have-a-definitive-answer/
>> >
>> > Also, at a guess, your indexing speed is so slow due to data 
>> > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > If you're using SolrJ, try commenting out the
>> > server.add() bit and running again. My guess is that your indexing 
>> > speed will be almost unchanged, in which case it's the data 
>> > acquisition process is where you should concentrate efforts. As a 
>> > comparison, I can index 11M Wikipedia docs on my laptop in 45 
>> > minutes without any attempts at parallelization.
>> >
>> >
>> > Best,
>> > Erick
>> >
>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
>> > susheel.kumar@thedigitalgroup.net> wrote:
>> > > Hi,
>> > >
>> > > Currently we are indexing 10 million document from database (10 
>> > > db data
>> > entities) & index size is around 8 GB on windows virtual box.
>> > Indexing in one shot taking 12+ hours while indexing parallel in 
>> > separate cores & merging them together taking 4+ hours.
>> > >
>> > > We are looking to scale to 100+ million documents and looking for
>> > recommendation on servers requirements on below parameters for a 
>> > Production environment. There can be 200+ users performing search 
>> > same
>> time.
>> > >
>> > > No of physical servers (considering solr cloud) Memory 
>> > > requirement Processor requirement (# cores) Linux as OS oppose to 
>> > > windows
>> > >
>> > > Thanks in advance.
>> > > Susheel
>> > >
>> >
>>

Re: Solr server requirements for 100+ million documents

Posted by Jack Krupansky <ja...@basetechnology.com>.

Lucene and Solr work best if the full index can be cached in OS memory. 
Sure, Lucene/Solr does work properly once the index no longer fits, but 
performance will drop off.

I would say that you could fit 100 million moderate-size documents on a 
single Solr server - provided that you give the OS enough RAM for the full 
Lucene index. That said, if you want to configure a SolrCloud cluster with 
shards, you can use more modest, commodity servers with less RAM, provided 
each server still fits it's fraction of the total Lucene index in that 
server's OS memory (file cache.)

You may also need to add replicas for each shard to accommodate query load - 
proof-of-concept testing is needed to verify that. It is worth noting that 
sharding can improve total query performance since each node only searches a 
fraction of the total data and those searches are done in parallel  (since 
they are on different machines.)

-- Jack Krupansky

-----Original Message----- 
From: Susheel Kumar
Sent: Sunday, January 26, 2014 10:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr server requirements for 100+ million documents

Thank you Erick for your valuable inputs. Yes, we have to re-index data 
again & again. I'll look into possibility of tuning db access.

On SolrJ and automating the indexing (incremental as well as one time) I 
want to get your opinion on below two points. We will be indexing separate 
sets of tables with similar data structure

- Should we use SolrJ and write Java programs that can be scheduled to 
trigger indexing on demand/schedule based.

- Is using SolrJ a better idea even for searching than using SolrNet? As our 
frontend is in .Net so we started using SolrNet but I am afraid down the 
road when we scale/support SolrClod using SolrJ is better?


Thanks
Susheel
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Sunday, January 26, 2014 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Dumping the raw data would probably be a good idea. I guarantee you'll be 
re-indexing the data several times as you change the schema to accommodate 
different requirements...

But it may also be worth spending some time figuring out why the DB access 
is slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N 
clients to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
<su...@thedigitalgroup.net> wrote:
> Hi Kranti,
>
> Attach are the solrconfig & schema xml for review. I did run indexing with 
> just few fields (5-6 fields) in schema.xml & keeping the same db config 
> but Indexing almost still taking similar time (average 1 million records 1 
> hr) which confirms that the bottleneck is in the data acquisition which in 
> our case is oracle database. I am thinking to not use dataimporthandler / 
> jdbc to get data from Oracle but to rather dump data somehow from oracle 
> using SQL loader and then index it. Any thoughts?
>
> Thnx
>
> -----Original Message-----
> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
> Sent: Saturday, January 25, 2014 12:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> can you post the complete solrconfig.xml file and schema.xml files to 
> review all of your settings that would impact your indexing performance.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < 
> susheel.kumar@thedigitalgroup.net> wrote:
>
>> Thanks, Svante. Your indexing speed using db seems to really fast.
>> Can you please provide some more detail on how you are indexing db
>> records. Is it thru DataImportHandler? And what database? Is that
>> local db?  We are indexing around 70 fields (60 multivalued) but data
>> is not populated always in all fields. The average size of document is in 
>> 5-10 kbs.
>>
>> -----Original Message-----
>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf
>> Of svante karlsson
>> Sent: Friday, January 24, 2014 5:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr server requirements for 100+ million documents
>>
>> I just indexed 100 million db docs (records) with 22 fields (4
>> multivalued) in 9524 sec using libcurl.
>> 11 million took 763 seconds so the speed drops somewhat with
>> increasing dbsize.
>>
>> We write 1000 docs (just an arbitrary number) in each request from
>> two threads. If you will be using solrcloud you will want more writer 
>> threads.
>>
>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual 
>> machine.
>>
>> /svante
>>
>>
>>
>>
>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>
>> > Thanks, Erick for the info.
>> >
>> > For indexing I agree the more time is consumed in data acquisition
>> > which in our case from Database.  For indexing currently we are
>> > using the manual process i.e. Solr dashboard Data Import but now
>> > looking to automate.  How do you suggest to automate the index part.
>> > Do you recommend to use SolrJ or should we try to automate using Curl?
>> >
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > Sent: Friday, January 24, 2014 2:59 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > Can't be done with the information you provided, and can only be
>> > guessed at even with more comprehensive information.
>> >
>> > Here's why:
>> >
>> >
>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
>> > -
>> > we
>> > -dont-have-a-definitive-answer/
>> >
>> > Also, at a guess, your indexing speed is so slow due to data
>> > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > If you're using SolrJ, try commenting out the
>> > server.add() bit and running again. My guess is that your indexing
>> > speed will be almost unchanged, in which case it's the data
>> > acquisition process is where you should concentrate efforts. As a
>> > comparison, I can index 11M Wikipedia docs on my laptop in 45
>> > minutes without any attempts at parallelization.
>> >
>> >
>> > Best,
>> > Erick
>> >
>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>> > susheel.kumar@thedigitalgroup.net> wrote:
>> > > Hi,
>> > >
>> > > Currently we are indexing 10 million document from database (10
>> > > db data
>> > entities) & index size is around 8 GB on windows virtual box.
>> > Indexing in one shot taking 12+ hours while indexing parallel in
>> > separate cores & merging them together taking 4+ hours.
>> > >
>> > > We are looking to scale to 100+ million documents and looking for
>> > recommendation on servers requirements on below parameters for a
>> > Production environment. There can be 200+ users performing search
>> > same
>> time.
>> > >
>> > > No of physical servers (considering solr cloud) Memory
>> > > requirement Processor requirement (# cores) Linux as OS oppose to
>> > > windows
>> > >
>> > > Thanks in advance.
>> > > Susheel
>> > >
>> >
>>

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Thank you Erick for your valuable inputs. Yes, we have to re-index data again & again. I'll look into possibility of tuning db access.  

On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure

- Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based. 

- Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better?
	

Thanks
Susheel
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Sunday, January 26, 2014 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements...

But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:
> Hi Kranti,
>
> Attach are the solrconfig & schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml & keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts?
>
> Thnx
>
> -----Original Message-----
> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
> Sent: Saturday, January 25, 2014 12:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < susheel.kumar@thedigitalgroup.net> wrote:
>
>> Thanks, Svante. Your indexing speed using db seems to really fast. 
>> Can you please provide some more detail on how you are indexing db 
>> records. Is it thru DataImportHandler? And what database? Is that 
>> local db?  We are indexing around 70 fields (60 multivalued) but data 
>> is not populated always in all fields. The average size of document is in 5-10 kbs.
>>
>> -----Original Message-----
>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf 
>> Of svante karlsson
>> Sent: Friday, January 24, 2014 5:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr server requirements for 100+ million documents
>>
>> I just indexed 100 million db docs (records) with 22 fields (4
>> multivalued) in 9524 sec using libcurl.
>> 11 million took 763 seconds so the speed drops somewhat with 
>> increasing dbsize.
>>
>> We write 1000 docs (just an arbitrary number) in each request from 
>> two threads. If you will be using solrcloud you will want more writer threads.
>>
>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with 
>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>>
>> /svante
>>
>>
>>
>>
>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>
>> > Thanks, Erick for the info.
>> >
>> > For indexing I agree the more time is consumed in data acquisition 
>> > which in our case from Database.  For indexing currently we are 
>> > using the manual process i.e. Solr dashboard Data Import but now 
>> > looking to automate.  How do you suggest to automate the index part.
>> > Do you recommend to use SolrJ or should we try to automate using Curl?
>> >
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > Sent: Friday, January 24, 2014 2:59 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > Can't be done with the information you provided, and can only be 
>> > guessed at even with more comprehensive information.
>> >
>> > Here's why:
>> >
>> >
>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
>> > -
>> > we
>> > -dont-have-a-definitive-answer/
>> >
>> > Also, at a guess, your indexing speed is so slow due to data 
>> > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > If you're using SolrJ, try commenting out the
>> > server.add() bit and running again. My guess is that your indexing 
>> > speed will be almost unchanged, in which case it's the data 
>> > acquisition process is where you should concentrate efforts. As a 
>> > comparison, I can index 11M Wikipedia docs on my laptop in 45 
>> > minutes without any attempts at parallelization.
>> >
>> >
>> > Best,
>> > Erick
>> >
>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
>> > susheel.kumar@thedigitalgroup.net> wrote:
>> > > Hi,
>> > >
>> > > Currently we are indexing 10 million document from database (10 
>> > > db data
>> > entities) & index size is around 8 GB on windows virtual box.
>> > Indexing in one shot taking 12+ hours while indexing parallel in 
>> > separate cores & merging them together taking 4+ hours.
>> > >
>> > > We are looking to scale to 100+ million documents and looking for
>> > recommendation on servers requirements on below parameters for a 
>> > Production environment. There can be 200+ users performing search 
>> > same
>> time.
>> > >
>> > > No of physical servers (considering solr cloud) Memory 
>> > > requirement Processor requirement (# cores) Linux as OS oppose to 
>> > > windows
>> > >
>> > > Thanks in advance.
>> > > Susheel
>> > >
>> >
>>

Re: Solr server requirements for 100+ million documents

Posted by Erick Erickson <er...@gmail.com>.

Dumping the raw data would probably be a good idea. I guarantee you'll be
re-indexing the data several times as you change the schema to accommodate
different requirements...

But it may also be worth spending some time figuring out why the DB
access is slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N clients
to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar
<su...@thedigitalgroup.net> wrote:
> Hi Kranti,
>
> Attach are the solrconfig & schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml & keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts?
>
> Thnx
>
> -----Original Message-----
> From: Kranti Parisa [mailto:kranti.parisa@gmail.com]
> Sent: Saturday, January 25, 2014 12:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < susheel.kumar@thedigitalgroup.net> wrote:
>
>> Thanks, Svante. Your indexing speed using db seems to really fast. Can
>> you please provide some more detail on how you are indexing db
>> records. Is it thru DataImportHandler? And what database? Is that
>> local db?  We are indexing around 70 fields (60 multivalued) but data
>> is not populated always in all fields. The average size of document is in 5-10 kbs.
>>
>> -----Original Message-----
>> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf
>> Of svante karlsson
>> Sent: Friday, January 24, 2014 5:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr server requirements for 100+ million documents
>>
>> I just indexed 100 million db docs (records) with 22 fields (4
>> multivalued) in 9524 sec using libcurl.
>> 11 million took 763 seconds so the speed drops somewhat with
>> increasing dbsize.
>>
>> We write 1000 docs (just an arbitrary number) in each request from two
>> threads. If you will be using solrcloud you will want more writer threads.
>>
>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>> SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>>
>> /svante
>>
>>
>>
>>
>> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>>
>> > Thanks, Erick for the info.
>> >
>> > For indexing I agree the more time is consumed in data acquisition
>> > which in our case from Database.  For indexing currently we are
>> > using the manual process i.e. Solr dashboard Data Import but now
>> > looking to automate.  How do you suggest to automate the index part.
>> > Do you recommend to use SolrJ or should we try to automate using Curl?
>> >
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > Sent: Friday, January 24, 2014 2:59 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > Can't be done with the information you provided, and can only be
>> > guessed at even with more comprehensive information.
>> >
>> > Here's why:
>> >
>> >
>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-
>> > we
>> > -dont-have-a-definitive-answer/
>> >
>> > Also, at a guess, your indexing speed is so slow due to data
>> > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > If you're using SolrJ, try commenting out the
>> > server.add() bit and running again. My guess is that your indexing
>> > speed will be almost unchanged, in which case it's the data
>> > acquisition process is where you should concentrate efforts. As a
>> > comparison, I can index 11M Wikipedia docs on my laptop in 45
>> > minutes without any attempts at parallelization.
>> >
>> >
>> > Best,
>> > Erick
>> >
>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>> > susheel.kumar@thedigitalgroup.net> wrote:
>> > > Hi,
>> > >
>> > > Currently we are indexing 10 million document from database (10 db
>> > > data
>> > entities) & index size is around 8 GB on windows virtual box.
>> > Indexing in one shot taking 12+ hours while indexing parallel in
>> > separate cores & merging them together taking 4+ hours.
>> > >
>> > > We are looking to scale to 100+ million documents and looking for
>> > recommendation on servers requirements on below parameters for a
>> > Production environment. There can be 200+ users performing search
>> > same
>> time.
>> > >
>> > > No of physical servers (considering solr cloud) Memory requirement
>> > > Processor requirement (# cores) Linux as OS oppose to windows
>> > >
>> > > Thanks in advance.
>> > > Susheel
>> > >
>> >
>>

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Hi Kranti,

Attach are the solrconfig & schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml & keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts? 

Thnx

-----Original Message-----
From: Kranti Parisa [mailto:kranti.parisa@gmail.com] 
Sent: Saturday, January 25, 2014 12:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < susheel.kumar@thedigitalgroup.net> wrote:

> Thanks, Svante. Your indexing speed using db seems to really fast. Can 
> you please provide some more detail on how you are indexing db 
> records. Is it thru DataImportHandler? And what database? Is that 
> local db?  We are indexing around 70 fields (60 multivalued) but data 
> is not populated always in all fields. The average size of document is in 5-10 kbs.
>
> -----Original Message-----
> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf 
> Of svante karlsson
> Sent: Friday, January 24, 2014 5:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> I just indexed 100 million db docs (records) with 22 fields (4
> multivalued) in 9524 sec using libcurl.
> 11 million took 763 seconds so the speed drops somewhat with 
> increasing dbsize.
>
> We write 1000 docs (just an arbitrary number) in each request from two 
> threads. If you will be using solrcloud you will want more writer threads.
>
> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one 
> SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>
> /svante
>
>
>
>
> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>
> > Thanks, Erick for the info.
> >
> > For indexing I agree the more time is consumed in data acquisition 
> > which in our case from Database.  For indexing currently we are 
> > using the manual process i.e. Solr dashboard Data Import but now 
> > looking to automate.  How do you suggest to automate the index part. 
> > Do you recommend to use SolrJ or should we try to automate using Curl?
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Friday, January 24, 2014 2:59 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Can't be done with the information you provided, and can only be 
> > guessed at even with more comprehensive information.
> >
> > Here's why:
> >
> >
> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-
> > we
> > -dont-have-a-definitive-answer/
> >
> > Also, at a guess, your indexing speed is so slow due to data 
> > acquisition; I rather doubt you're being limited by raw Solr indexing.
> > If you're using SolrJ, try commenting out the
> > server.add() bit and running again. My guess is that your indexing 
> > speed will be almost unchanged, in which case it's the data 
> > acquisition process is where you should concentrate efforts. As a 
> > comparison, I can index 11M Wikipedia docs on my laptop in 45 
> > minutes without any attempts at parallelization.
> >
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
> > susheel.kumar@thedigitalgroup.net> wrote:
> > > Hi,
> > >
> > > Currently we are indexing 10 million document from database (10 db 
> > > data
> > entities) & index size is around 8 GB on windows virtual box. 
> > Indexing in one shot taking 12+ hours while indexing parallel in 
> > separate cores & merging them together taking 4+ hours.
> > >
> > > We are looking to scale to 100+ million documents and looking for
> > recommendation on servers requirements on below parameters for a 
> > Production environment. There can be 200+ users performing search 
> > same
> time.
> > >
> > > No of physical servers (considering solr cloud) Memory requirement 
> > > Processor requirement (# cores) Linux as OS oppose to windows
> > >
> > > Thanks in advance.
> > > Susheel
> > >
> >
>

Re: Solr server requirements for 100+ million documents

Posted by Kranti Parisa <kr...@gmail.com>.

can you post the complete solrconfig.xml file and schema.xml files to
review all of your settings that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
susheel.kumar@thedigitalgroup.net> wrote:

> Thanks, Svante. Your indexing speed using db seems to really fast. Can you
> please provide some more detail on how you are indexing db records. Is it
> thru DataImportHandler? And what database? Is that local db?  We are
> indexing around 70 fields (60 multivalued) but data is not populated always
> in all fields. The average size of document is in 5-10 kbs.
>
> -----Original Message-----
> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf Of
> svante karlsson
> Sent: Friday, January 24, 2014 5:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> I just indexed 100 million db docs (records) with 22 fields (4
> multivalued) in 9524 sec using libcurl.
> 11 million took 763 seconds so the speed drops somewhat with increasing
> dbsize.
>
> We write 1000 docs (just an arbitrary number) in each request from two
> threads. If you will be using solrcloud you will want more writer threads.
>
> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD
> and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>
> /svante
>
>
>
>
> 2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>
>
> > Thanks, Erick for the info.
> >
> > For indexing I agree the more time is consumed in data acquisition
> > which in our case from Database.  For indexing currently we are using
> > the manual process i.e. Solr dashboard Data Import but now looking to
> > automate.  How do you suggest to automate the index part. Do you
> > recommend to use SolrJ or should we try to automate using Curl?
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Friday, January 24, 2014 2:59 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Can't be done with the information you provided, and can only be
> > guessed at even with more comprehensive information.
> >
> > Here's why:
> >
> >
> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
> > -dont-have-a-definitive-answer/
> >
> > Also, at a guess, your indexing speed is so slow due to data
> > acquisition; I rather doubt you're being limited by raw Solr indexing.
> > If you're using SolrJ, try commenting out the
> > server.add() bit and running again. My guess is that your indexing
> > speed will be almost unchanged, in which case it's the data
> > acquisition process is where you should concentrate efforts. As a
> > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
> > without any attempts at parallelization.
> >
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> > susheel.kumar@thedigitalgroup.net> wrote:
> > > Hi,
> > >
> > > Currently we are indexing 10 million document from database (10 db
> > > data
> > entities) & index size is around 8 GB on windows virtual box. Indexing
> > in one shot taking 12+ hours while indexing parallel in separate cores
> > & merging them together taking 4+ hours.
> > >
> > > We are looking to scale to 100+ million documents and looking for
> > recommendation on servers requirements on below parameters for a
> > Production environment. There can be 200+ users performing search same
> time.
> > >
> > > No of physical servers (considering solr cloud) Memory requirement
> > > Processor requirement (# cores) Linux as OS oppose to windows
> > >
> > > Thanks in advance.
> > > Susheel
> > >
> >
>

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db?  We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs.  

-----Original Message-----
From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf Of svante karlsson
Sent: Friday, January 24, 2014 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with increasing dbsize.

We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads.

The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

/svante




2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>

> Thanks, Erick for the info.
>
> For indexing I agree the more time is consumed in data acquisition 
> which in our case from Database.  For indexing currently we are using 
> the manual process i.e. Solr dashboard Data Import but now looking to 
> automate.  How do you suggest to automate the index part. Do you 
> recommend to use SolrJ or should we try to automate using Curl?
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, January 24, 2014 2:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Can't be done with the information you provided, and can only be 
> guessed at even with more comprehensive information.
>
> Here's why:
>
>
> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
> -dont-have-a-definitive-answer/
>
> Also, at a guess, your indexing speed is so slow due to data 
> acquisition; I rather doubt you're being limited by raw Solr indexing. 
> If you're using SolrJ, try commenting out the
> server.add() bit and running again. My guess is that your indexing 
> speed will be almost unchanged, in which case it's the data 
> acquisition process is where you should concentrate efforts. As a 
> comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes 
> without any attempts at parallelization.
>
>
> Best,
> Erick
>
> On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
> susheel.kumar@thedigitalgroup.net> wrote:
> > Hi,
> >
> > Currently we are indexing 10 million document from database (10 db 
> > data
> entities) & index size is around 8 GB on windows virtual box. Indexing 
> in one shot taking 12+ hours while indexing parallel in separate cores 
> & merging them together taking 4+ hours.
> >
> > We are looking to scale to 100+ million documents and looking for
> recommendation on servers requirements on below parameters for a 
> Production environment. There can be 200+ users performing search same time.
> >
> > No of physical servers (considering solr cloud) Memory requirement 
> > Processor requirement (# cores) Linux as OS oppose to windows
> >
> > Thanks in advance.
> > Susheel
> >
>

Re: Solr server requirements for 100+ million documents

Posted by svante karlsson <sa...@csi.se>.

I just indexed 100 million db docs (records) with 22 fields (4 multivalued)
in 9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with increasing
dbsize.

We write 1000 docs (just an arbitrary number) in each request from two
threads. If you will be using solrcloud you will want more writer threads.

The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD
and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

/svante




2014/1/24 Susheel Kumar <su...@thedigitalgroup.net>

> Thanks, Erick for the info.
>
> For indexing I agree the more time is consumed in data acquisition which
> in our case from Database.  For indexing currently we are using the manual
> process i.e. Solr dashboard Data Import but now looking to automate.  How
> do you suggest to automate the index part. Do you recommend to use SolrJ or
> should we try to automate using Curl?
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, January 24, 2014 2:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Can't be done with the information you provided, and can only be guessed
> at even with more comprehensive information.
>
> Here's why:
>
>
> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Also, at a guess, your indexing speed is so slow due to data acquisition;
> I rather doubt you're being limited by raw Solr indexing. If you're using
> SolrJ, try commenting out the
> server.add() bit and running again. My guess is that your indexing speed
> will be almost unchanged, in which case it's the data acquisition process
> is where you should concentrate efforts. As a comparison, I can index 11M
> Wikipedia docs on my laptop in 45 minutes without any attempts at
> parallelization.
>
>
> Best,
> Erick
>
> On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> susheel.kumar@thedigitalgroup.net> wrote:
> > Hi,
> >
> > Currently we are indexing 10 million document from database (10 db data
> entities) & index size is around 8 GB on windows virtual box. Indexing in
> one shot taking 12+ hours while indexing parallel in separate cores &
> merging them together taking 4+ hours.
> >
> > We are looking to scale to 100+ million documents and looking for
> recommendation on servers requirements on below parameters for a Production
> environment. There can be 200+ users performing search same time.
> >
> > No of physical servers (considering solr cloud) Memory requirement
> > Processor requirement (# cores) Linux as OS oppose to windows
> >
> > Thanks in advance.
> > Susheel
> >
>

RE: Solr server requirements for 100+ million documents

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Thanks, Erick for the info.

For indexing I agree the more time is consumed in data acquisition which in our case from Database.  For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate.  How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl?

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, January 24, 2014 2:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Can't be done with the information you provided, and can only be guessed at even with more comprehensive information.

Here's why:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the
server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization.

Best,
Erick

On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <su...@thedigitalgroup.net> wrote:
> Hi,
>
> Currently we are indexing 10 million document from database (10 db data entities) & index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores & merging them together taking 4+ hours.
>
> We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time.
>
> No of physical servers (considering solr cloud) Memory requirement 
> Processor requirement (# cores) Linux as OS oppose to windows
>
> Thanks in advance.
> Susheel
>

Re: Solr server requirements for 100+ million documents

Posted by Erick Erickson <er...@gmail.com>.

Can't be done with the information you provided, and can only
be guessed at even with more comprehensive information.

Here's why:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, at a guess, your indexing speed is so slow due to data
acquisition; I rather doubt
you're being limited by raw Solr indexing. If you're using SolrJ, try
commenting out the
server.add() bit and running again. My guess is that your indexing
speed will be almost
unchanged, in which case it's the data acquisition process is where
you should concentrate
efforts. As a comparison, I can index 11M Wikipedia docs on my laptop
in 45 minutes without
any attempts at parallelization.

Best,
Erick

On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar
<su...@thedigitalgroup.net> wrote:
> Hi,
>
> Currently we are indexing 10 million document from database (10 db data entities) & index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores & merging them together taking 4+ hours.
>
> We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time.
>
> No of physical servers (considering solr cloud)
> Memory requirement
> Processor requirement (# cores)
> Linux as OS oppose to windows
>
> Thanks in advance.
> Susheel
>