You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Ortega <da...@gmail.com> on 2017/08/22 18:16:43 UTC

Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

*Main Problems*


We are involved in a migration from Solr Master/Slave infrastructure to
SolrCloud infrastructure.



The main problems that we have now are:



   - Excessive resources consumption: Currently we have 5 instances with 80
   processors/768 GB RAM each instance using SSD Hard Disk Drives that doesn't
   support the load that we have in the other architecture. In our
   Master-Slave architecture we have only 7 Virtual Machines with lower specs
   (4 processors and 16 GB each instance using SSD Hard Disk Drives too). So,
   at the moment our SolrCloud infrastructure is wasting several dozen times
   more resources than our Solr Master/Slave infrastructure.
   - Despite spending more resources we have worst query times (compared to
   Solr in master/slave architecture)


*Search infrastructure (SolrCloud infrastructure)*



As we cannot use DIH Handler (which is what we use in Solr Master/Slave), we
have developed an application which reads every transaction from Oracle,
builds a document collection searching in the database and sends the result
to the */update* handler every 200 milliseconds using SolrJ client. This
application tries to delete the possible duplicates in each update window,
but we are using solr’s de-duplication techniques
<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
 too.



We are indexing ~100 documents per second (with peaks of ~1000 documents
per second).



Every search query is centralized in other application which exposes a DSL
behind a REST API and uses SolrJ client too to perform queries. We have
peaks of 2000 QPS.

*Cluster structure **(SolrCloud infrastructure)*



At the moment, the cluster has 30 SolrCloud instances with the same specs
(Same physical hosts, same JVM Settings, etc.).



*Main collection*



In our use case we are using this collection as a NoSQL database basically.
Our document is composed of about 300 fields that represents an advert, and
is a denormalization of its relational representation in Oracle.


We are using all our nodes to store the  collection in 3 shards. So, each
shard has 10 replicas.


At the moment, we are only indexing a subset of the adverts stored in
Oracle, but our goal is to store all the ads that we have in the DB (a few
tens of millions of documents). We have NRT requirements, so we need to
index every document as soon as posible once it’s changed in Oracle.



We have defined the properties of each field (if it’s stored/indexed or
not, if should be defined as DocValue, etc…) considering the use of that
field.



*Index size **(SolrCloud infrastructure)*



The index size is currently above 6 GB, storing 1.300.000 documents in each
shard. So, we are storing 3.900.000 documents and the total index size is
18 GB.



*Indexation **(SolrCloud infrastructure)*



The commits *aren’t* triggered by the application described before. The
hardcommit/softcommit interval are configured in Solr:



   - *HardCommit:* every 15 minutes (with opensearcher = false)
   - *SoftCommit:* every 5 seconds



*Apache Solr Version*



We are currently using the last version of Solr (6.6.0) under an Oracle VM
(Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64 bits)) in
both deployments.


The question is... What is wrong here?!?!?!

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Scott Stults <ss...@opensourceconnections.com>.
Dani,

It might be time to attach some instrumentation to one of your nodes.
Finding out which classes are occupying the memory will help narrow the
issue.

Are you using a lot of facets, grouping, or stats during your queries?
Also, when you were doing Master/Slave, was that on the same version of
Solr as you're using now in SolrCloud mode?


-Scott

On Mon, Aug 28, 2017 at 4:50 AM, Daniel Ortega <da...@gmail.com>
wrote:

> Hi Scott,
>
> Yes, we think that our usage scenario falls into Index-Heavy/Query-Heavy
> too. We have tested with several values in softcommit/hardcommit values
> (from few seconds to minutes) with no appreciable improvements :(
>
> Thanks for your reply!
>
> - Daniel
>
> 2017-08-25 6:45 GMT+02:00 Scott Stults <sstults@opensourceconnections.com
> >:
>
> > Hi Dani,
> >
> > It seems like your use case falls into the Index-Heavy / Query-Heavy
> > category, so you might try increasing your hard commit frequency to 15
> > seconds rather than 15 minutes:
> >
> > https://lucidworks.com/2013/08/23/understanding-
> > transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> >
> > -Scott
> >
> > On Thu, Aug 24, 2017 at 10:03 AM, Daniel Ortega <
> > danielortegaufano@gmail.com
> > > wrote:
> >
> > > Hi Scott,
> > >
> > > In our indexing service we are using that client too
> > > (org.apache.solr.client.solrj.impl.CloudSolrClient) :)
> > >
> > > This is out Update Request Processor chain configuration:
> > >
> > > <updateProcessor class="solr.processor.SignatureUpdateProcessorFactor
> y"
> > > name
> > > ="signature"> <bool name="enabled">true</bool> <str
> > name="signatureField">
> > > hash</str> <bool name="overwriteDupes">false</bool> <str name=
> > > "signatureClass">solr.processor.Lookup3Signature</str>
> > </updateProcessor>
> > > <
> > > updateRequestProcessorChain processor="signature" name="dedupe">
> > <processor
> > > class="solr.LogUpdateProcessorFactory" /> <processor class=
> > > "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
> <!--
> > > de-duplication process explained in:
> > > https://cwiki.apache.org/confluence/display/solr/De-Duplication --> <
> > > requestHandler name="/update" class="solr.UpdateRequestHandler" > <lst
> > > name=
> > > "defaults"> <str name="update.chain">dedupe</str> </lst>
> > </requestHandler>
> > >
> > > Thanks for your reply :)
> > >
> > > - Dani
> > >
> > > 2017-08-24 14:49 GMT+02:00 Scott Stults <sstults@
> > opensourceconnections.com
> > > >:
> > >
> > > > Hi Daniel,
> > > >
> > > > SolrJ has a few client implementations to choose from:
> CloudSolrClient,
> > > > ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You
> said
> > > your
> > > > query service uses CloudSolrClient, but it would be good to verify
> > which
> > > > implementation your indexing service uses.
> > > >
> > > > One of the problems you might be having is with your deduplication
> > step.
> > > > Can you post your Update Request Processor Chain?
> > > >
> > > >
> > > > -Scott
> > > >
> > > >
> > > > On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <
> > > > danielortegaufano@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Scott,
> > > > >
> > > > > - *Can you describe the process that queries the DB and sends
> records
> > > to
> > > > *
> > > > > *Solr?*
> > > > >
> > > > > We are enqueueing ids during every ORACLE transaction (in
> > > > insert/updates).
> > > > >
> > > > > An application dequeues every id and perform queries against dozen
> of
> > > > > tables in the relational model to retrieve the fields to build the
> > > > > document.  As we know that we are modifying the same ORACLE row in
> > > > > different (but consecutive) transactions, we store only the last
> > > version
> > > > of
> > > > > the modified documents in a map data structure.
> > > > >
> > > > > The application has a configurable interval to send the documents
> > > stored
> > > > in
> > > > > the map to the update handler (we have tested different intervals
> > from
> > > > few
> > > > > milliseconds to several seconds) using the SolrJ client. Actually
> we
> > > are
> > > > > sending all the documents every 15 seconds.
> > > > >
> > > > > This application is developed using Java, Spring and Maven and we
> > have
> > > > > several instances.
> > > > >
> > > > > -* Is it a SolrJ-based application?*
> > > > >
> > > > > Yes, it is. We aren't using the last version of SolrJ client (we
> are
> > > > > currently using SolrJ v6.3.0).
> > > > >
> > > > > - *If it is, which client package are you using?*
> > > > >
> > > > > I don't know exactly what do you mean saying 'client package' :)
> > > > >
> > > > > - *How many documents do you send at once?*
> > > > >
> > > > > It depends on the defined interval described before and the number
> of
> > > > > transactions executed in our relational database. From dozens to
> few
> > > > > hundreds (and even thousands).
> > > > >
> > > > > - *Are you sending your indexing or query traffic through a load
> > > > balancer?*
> > > > >
> > > > > We aren't using a load balancer for indexing, but we have all our
> > Rest
> > > > > Query services through an HAProxy (using 'leastconn' algorithm).
> The
> > > Rest
> > > > > Query Services performs queries using the CloudSolrClient.
> > > > >
> > > > > Thanks for your reply,
> > > > > if you need any further information don't hesitate to ask
> > > > >
> > > > > Daniel
> > > > >
> > > > > 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@
> > > > opensourceconnections.com
> > > > > >:
> > > > >
> > > > > > Hi Daniel,
> > > > > >
> > > > > > Great background information about your setup! I've got just a
> few
> > > more
> > > > > > questions:
> > > > > >
> > > > > > - Can you describe the process that queries the DB and sends
> > records
> > > to
> > > > > > Solr?
> > > > > > - Is it a SolrJ-based application?
> > > > > > - If it is, which client package are you using?
> > > > > > - How many documents do you send at once?
> > > > > > - Are you sending your indexing or query traffic through a load
> > > > balancer?
> > > > > >
> > > > > > If you're sending documents to each replica as fast as they can
> > take
> > > > > them,
> > > > > > you might be seeing a bottleneck at the shard leaders. The SolrJ
> > > > > > CloudSolrClient finds out from Zookeeper which nodes are the
> shard
> > > > > leaders
> > > > > > and sends docs directly to them.
> > > > > >
> > > > > >
> > > > > > -Scott
> > > > > >
> > > > > > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> > > > > > danielortegaufano@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > *Main Problems*
> > > > > > >
> > > > > > >
> > > > > > > We are involved in a migration from Solr Master/Slave
> > > infrastructure
> > > > to
> > > > > > > SolrCloud infrastructure.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The main problems that we have now are:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    - Excessive resources consumption: Currently we have 5
> > instances
> > > > > with
> > > > > > 80
> > > > > > >    processors/768 GB RAM each instance using SSD Hard Disk
> Drives
> > > > that
> > > > > > > doesn't
> > > > > > >    support the load that we have in the other architecture. In
> > our
> > > > > > >    Master-Slave architecture we have only 7 Virtual Machines
> with
> > > > lower
> > > > > > > specs
> > > > > > >    (4 processors and 16 GB each instance using SSD Hard Disk
> > Drives
> > > > > too).
> > > > > > > So,
> > > > > > >    at the moment our SolrCloud infrastructure is wasting
> several
> > > > dozen
> > > > > > > times
> > > > > > >    more resources than our Solr Master/Slave infrastructure.
> > > > > > >    - Despite spending more resources we have worst query times
> > > > > (compared
> > > > > > to
> > > > > > >    Solr in master/slave architecture)
> > > > > > >
> > > > > > >
> > > > > > > *Search infrastructure (SolrCloud infrastructure)*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > As we cannot use DIH Handler (which is what we use in Solr
> > > > > Master/Slave),
> > > > > > > we
> > > > > > > have developed an application which reads every transaction
> from
> > > > > Oracle,
> > > > > > > builds a document collection searching in the database and
> sends
> > > the
> > > > > > result
> > > > > > > to the */update* handler every 200 milliseconds using SolrJ
> > client.
> > > > > This
> > > > > > > application tries to delete the possible duplicates in each
> > update
> > > > > > window,
> > > > > > > but we are using solr’s de-duplication techniques
> > > > > > > <https://emea01.safelinks.protection.outlook.com/?url=
> > > > > > > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > > > > > > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com
> %
> > > > > > > 7Cb169ea024abc4954927208d4bc6868eb%
> > 7Cd78b7929c2a34897ae9a7d8f8dc1
> > > > > > > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> > > > > > > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
> > > > > > >  too.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > We are indexing ~100 documents per second (with peaks of ~1000
> > > > > documents
> > > > > > > per second).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Every search query is centralized in other application which
> > > exposes
> > > > a
> > > > > > DSL
> > > > > > > behind a REST API and uses SolrJ client too to perform queries.
> > We
> > > > have
> > > > > > > peaks of 2000 QPS.
> > > > > > >
> > > > > > > *Cluster structure **(SolrCloud infrastructure)*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > At the moment, the cluster has 30 SolrCloud instances with the
> > same
> > > > > specs
> > > > > > > (Same physical hosts, same JVM Settings, etc.).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Main collection*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > In our use case we are using this collection as a NoSQL
> database
> > > > > > basically.
> > > > > > > Our document is composed of about 300 fields that represents an
> > > > advert,
> > > > > > and
> > > > > > > is a denormalization of its relational representation in
> Oracle.
> > > > > > >
> > > > > > >
> > > > > > > We are using all our nodes to store the  collection in 3
> shards.
> > > So,
> > > > > each
> > > > > > > shard has 10 replicas.
> > > > > > >
> > > > > > >
> > > > > > > At the moment, we are only indexing a subset of the adverts
> > stored
> > > in
> > > > > > > Oracle, but our goal is to store all the ads that we have in
> the
> > DB
> > > > (a
> > > > > > few
> > > > > > > tens of millions of documents). We have NRT requirements, so we
> > > need
> > > > to
> > > > > > > index every document as soon as posible once it’s changed in
> > > Oracle.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > We have defined the properties of each field (if it’s
> > > stored/indexed
> > > > or
> > > > > > > not, if should be defined as DocValue, etc…) considering the
> use
> > of
> > > > > that
> > > > > > > field.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Index size **(SolrCloud infrastructure)*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The index size is currently above 6 GB, storing 1.300.000
> > documents
> > > > in
> > > > > > each
> > > > > > > shard. So, we are storing 3.900.000 documents and the total
> index
> > > > size
> > > > > is
> > > > > > > 18 GB.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Indexation **(SolrCloud infrastructure)*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The commits *aren’t* triggered by the application described
> > before.
> > > > The
> > > > > > > hardcommit/softcommit interval are configured in Solr:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    - *HardCommit:* every 15 minutes (with opensearcher = false)
> > > > > > >    - *SoftCommit:* every 5 seconds
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Apache Solr Version*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > We are currently using the last version of Solr (6.6.0) under
> an
> > > > Oracle
> > > > > > VM
> > > > > > > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle
> (64
> > > > > bits))
> > > > > > in
> > > > > > > both deployments.
> > > > > > >
> > > > > > >
> > > > > > > The question is... What is wrong here?!?!?!
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Scott Stults | Founder & Solutions Architect | OpenSource
> > > Connections,
> > > > > LLC
> > > > > > | 434.409.2780
> > > > > > http://www.opensourceconnections.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Scott Stults | Founder & Solutions Architect | OpenSource
> Connections,
> > > LLC
> > > > | 434.409.2780
> > > > http://www.opensourceconnections.com
> > > >
> > >
> >
> >
> >
> > --
> > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> LLC
> > | 434.409.2780
> > http://www.opensourceconnections.com
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Daniel Ortega <da...@gmail.com>.
Hi Scott,

Yes, we think that our usage scenario falls into Index-Heavy/Query-Heavy
too. We have tested with several values in softcommit/hardcommit values
(from few seconds to minutes) with no appreciable improvements :(

Thanks for your reply!

- Daniel

2017-08-25 6:45 GMT+02:00 Scott Stults <ss...@opensourceconnections.com>:

> Hi Dani,
>
> It seems like your use case falls into the Index-Heavy / Query-Heavy
> category, so you might try increasing your hard commit frequency to 15
> seconds rather than 15 minutes:
>
> https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
>
>
> -Scott
>
> On Thu, Aug 24, 2017 at 10:03 AM, Daniel Ortega <
> danielortegaufano@gmail.com
> > wrote:
>
> > Hi Scott,
> >
> > In our indexing service we are using that client too
> > (org.apache.solr.client.solrj.impl.CloudSolrClient) :)
> >
> > This is out Update Request Processor chain configuration:
> >
> > <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
> > name
> > ="signature"> <bool name="enabled">true</bool> <str
> name="signatureField">
> > hash</str> <bool name="overwriteDupes">false</bool> <str name=
> > "signatureClass">solr.processor.Lookup3Signature</str>
> </updateProcessor>
> > <
> > updateRequestProcessorChain processor="signature" name="dedupe">
> <processor
> > class="solr.LogUpdateProcessorFactory" /> <processor class=
> > "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <!--
> > de-duplication process explained in:
> > https://cwiki.apache.org/confluence/display/solr/De-Duplication --> <
> > requestHandler name="/update" class="solr.UpdateRequestHandler" > <lst
> > name=
> > "defaults"> <str name="update.chain">dedupe</str> </lst>
> </requestHandler>
> >
> > Thanks for your reply :)
> >
> > - Dani
> >
> > 2017-08-24 14:49 GMT+02:00 Scott Stults <sstults@
> opensourceconnections.com
> > >:
> >
> > > Hi Daniel,
> > >
> > > SolrJ has a few client implementations to choose from: CloudSolrClient,
> > > ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said
> > your
> > > query service uses CloudSolrClient, but it would be good to verify
> which
> > > implementation your indexing service uses.
> > >
> > > One of the problems you might be having is with your deduplication
> step.
> > > Can you post your Update Request Processor Chain?
> > >
> > >
> > > -Scott
> > >
> > >
> > > On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <
> > > danielortegaufano@gmail.com>
> > > wrote:
> > >
> > > > Hi Scott,
> > > >
> > > > - *Can you describe the process that queries the DB and sends records
> > to
> > > *
> > > > *Solr?*
> > > >
> > > > We are enqueueing ids during every ORACLE transaction (in
> > > insert/updates).
> > > >
> > > > An application dequeues every id and perform queries against dozen of
> > > > tables in the relational model to retrieve the fields to build the
> > > > document.  As we know that we are modifying the same ORACLE row in
> > > > different (but consecutive) transactions, we store only the last
> > version
> > > of
> > > > the modified documents in a map data structure.
> > > >
> > > > The application has a configurable interval to send the documents
> > stored
> > > in
> > > > the map to the update handler (we have tested different intervals
> from
> > > few
> > > > milliseconds to several seconds) using the SolrJ client. Actually we
> > are
> > > > sending all the documents every 15 seconds.
> > > >
> > > > This application is developed using Java, Spring and Maven and we
> have
> > > > several instances.
> > > >
> > > > -* Is it a SolrJ-based application?*
> > > >
> > > > Yes, it is. We aren't using the last version of SolrJ client (we are
> > > > currently using SolrJ v6.3.0).
> > > >
> > > > - *If it is, which client package are you using?*
> > > >
> > > > I don't know exactly what do you mean saying 'client package' :)
> > > >
> > > > - *How many documents do you send at once?*
> > > >
> > > > It depends on the defined interval described before and the number of
> > > > transactions executed in our relational database. From dozens to few
> > > > hundreds (and even thousands).
> > > >
> > > > - *Are you sending your indexing or query traffic through a load
> > > balancer?*
> > > >
> > > > We aren't using a load balancer for indexing, but we have all our
> Rest
> > > > Query services through an HAProxy (using 'leastconn' algorithm). The
> > Rest
> > > > Query Services performs queries using the CloudSolrClient.
> > > >
> > > > Thanks for your reply,
> > > > if you need any further information don't hesitate to ask
> > > >
> > > > Daniel
> > > >
> > > > 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@
> > > opensourceconnections.com
> > > > >:
> > > >
> > > > > Hi Daniel,
> > > > >
> > > > > Great background information about your setup! I've got just a few
> > more
> > > > > questions:
> > > > >
> > > > > - Can you describe the process that queries the DB and sends
> records
> > to
> > > > > Solr?
> > > > > - Is it a SolrJ-based application?
> > > > > - If it is, which client package are you using?
> > > > > - How many documents do you send at once?
> > > > > - Are you sending your indexing or query traffic through a load
> > > balancer?
> > > > >
> > > > > If you're sending documents to each replica as fast as they can
> take
> > > > them,
> > > > > you might be seeing a bottleneck at the shard leaders. The SolrJ
> > > > > CloudSolrClient finds out from Zookeeper which nodes are the shard
> > > > leaders
> > > > > and sends docs directly to them.
> > > > >
> > > > >
> > > > > -Scott
> > > > >
> > > > > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> > > > > danielortegaufano@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > *Main Problems*
> > > > > >
> > > > > >
> > > > > > We are involved in a migration from Solr Master/Slave
> > infrastructure
> > > to
> > > > > > SolrCloud infrastructure.
> > > > > >
> > > > > >
> > > > > >
> > > > > > The main problems that we have now are:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - Excessive resources consumption: Currently we have 5
> instances
> > > > with
> > > > > 80
> > > > > >    processors/768 GB RAM each instance using SSD Hard Disk Drives
> > > that
> > > > > > doesn't
> > > > > >    support the load that we have in the other architecture. In
> our
> > > > > >    Master-Slave architecture we have only 7 Virtual Machines with
> > > lower
> > > > > > specs
> > > > > >    (4 processors and 16 GB each instance using SSD Hard Disk
> Drives
> > > > too).
> > > > > > So,
> > > > > >    at the moment our SolrCloud infrastructure is wasting several
> > > dozen
> > > > > > times
> > > > > >    more resources than our Solr Master/Slave infrastructure.
> > > > > >    - Despite spending more resources we have worst query times
> > > > (compared
> > > > > to
> > > > > >    Solr in master/slave architecture)
> > > > > >
> > > > > >
> > > > > > *Search infrastructure (SolrCloud infrastructure)*
> > > > > >
> > > > > >
> > > > > >
> > > > > > As we cannot use DIH Handler (which is what we use in Solr
> > > > Master/Slave),
> > > > > > we
> > > > > > have developed an application which reads every transaction from
> > > > Oracle,
> > > > > > builds a document collection searching in the database and sends
> > the
> > > > > result
> > > > > > to the */update* handler every 200 milliseconds using SolrJ
> client.
> > > > This
> > > > > > application tries to delete the possible duplicates in each
> update
> > > > > window,
> > > > > > but we are using solr’s de-duplication techniques
> > > > > > <https://emea01.safelinks.protection.outlook.com/?url=
> > > > > > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > > > > > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%
> > > > > > 7Cb169ea024abc4954927208d4bc6868eb%
> 7Cd78b7929c2a34897ae9a7d8f8dc1
> > > > > > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> > > > > > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
> > > > > >  too.
> > > > > >
> > > > > >
> > > > > >
> > > > > > We are indexing ~100 documents per second (with peaks of ~1000
> > > > documents
> > > > > > per second).
> > > > > >
> > > > > >
> > > > > >
> > > > > > Every search query is centralized in other application which
> > exposes
> > > a
> > > > > DSL
> > > > > > behind a REST API and uses SolrJ client too to perform queries.
> We
> > > have
> > > > > > peaks of 2000 QPS.
> > > > > >
> > > > > > *Cluster structure **(SolrCloud infrastructure)*
> > > > > >
> > > > > >
> > > > > >
> > > > > > At the moment, the cluster has 30 SolrCloud instances with the
> same
> > > > specs
> > > > > > (Same physical hosts, same JVM Settings, etc.).
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Main collection*
> > > > > >
> > > > > >
> > > > > >
> > > > > > In our use case we are using this collection as a NoSQL database
> > > > > basically.
> > > > > > Our document is composed of about 300 fields that represents an
> > > advert,
> > > > > and
> > > > > > is a denormalization of its relational representation in Oracle.
> > > > > >
> > > > > >
> > > > > > We are using all our nodes to store the  collection in 3 shards.
> > So,
> > > > each
> > > > > > shard has 10 replicas.
> > > > > >
> > > > > >
> > > > > > At the moment, we are only indexing a subset of the adverts
> stored
> > in
> > > > > > Oracle, but our goal is to store all the ads that we have in the
> DB
> > > (a
> > > > > few
> > > > > > tens of millions of documents). We have NRT requirements, so we
> > need
> > > to
> > > > > > index every document as soon as posible once it’s changed in
> > Oracle.
> > > > > >
> > > > > >
> > > > > >
> > > > > > We have defined the properties of each field (if it’s
> > stored/indexed
> > > or
> > > > > > not, if should be defined as DocValue, etc…) considering the use
> of
> > > > that
> > > > > > field.
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Index size **(SolrCloud infrastructure)*
> > > > > >
> > > > > >
> > > > > >
> > > > > > The index size is currently above 6 GB, storing 1.300.000
> documents
> > > in
> > > > > each
> > > > > > shard. So, we are storing 3.900.000 documents and the total index
> > > size
> > > > is
> > > > > > 18 GB.
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Indexation **(SolrCloud infrastructure)*
> > > > > >
> > > > > >
> > > > > >
> > > > > > The commits *aren’t* triggered by the application described
> before.
> > > The
> > > > > > hardcommit/softcommit interval are configured in Solr:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - *HardCommit:* every 15 minutes (with opensearcher = false)
> > > > > >    - *SoftCommit:* every 5 seconds
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Apache Solr Version*
> > > > > >
> > > > > >
> > > > > >
> > > > > > We are currently using the last version of Solr (6.6.0) under an
> > > Oracle
> > > > > VM
> > > > > > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64
> > > > bits))
> > > > > in
> > > > > > both deployments.
> > > > > >
> > > > > >
> > > > > > The question is... What is wrong here?!?!?!
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Scott Stults | Founder & Solutions Architect | OpenSource
> > Connections,
> > > > LLC
> > > > > | 434.409.2780
> > > > > http://www.opensourceconnections.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> > LLC
> > > | 434.409.2780
> > > http://www.opensourceconnections.com
> > >
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Scott Stults <ss...@opensourceconnections.com>.
Hi Dani,

It seems like your use case falls into the Index-Heavy / Query-Heavy
category, so you might try increasing your hard commit frequency to 15
seconds rather than 15 minutes:

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


-Scott

On Thu, Aug 24, 2017 at 10:03 AM, Daniel Ortega <danielortegaufano@gmail.com
> wrote:

> Hi Scott,
>
> In our indexing service we are using that client too
> (org.apache.solr.client.solrj.impl.CloudSolrClient) :)
>
> This is out Update Request Processor chain configuration:
>
> <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
> name
> ="signature"> <bool name="enabled">true</bool> <str name="signatureField">
> hash</str> <bool name="overwriteDupes">false</bool> <str name=
> "signatureClass">solr.processor.Lookup3Signature</str> </updateProcessor>
> <
> updateRequestProcessorChain processor="signature" name="dedupe"> <processor
> class="solr.LogUpdateProcessorFactory" /> <processor class=
> "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <!--
> de-duplication process explained in:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication --> <
> requestHandler name="/update" class="solr.UpdateRequestHandler" > <lst
> name=
> "defaults"> <str name="update.chain">dedupe</str> </lst> </requestHandler>
>
> Thanks for your reply :)
>
> - Dani
>
> 2017-08-24 14:49 GMT+02:00 Scott Stults <sstults@opensourceconnections.com
> >:
>
> > Hi Daniel,
> >
> > SolrJ has a few client implementations to choose from: CloudSolrClient,
> > ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said
> your
> > query service uses CloudSolrClient, but it would be good to verify which
> > implementation your indexing service uses.
> >
> > One of the problems you might be having is with your deduplication step.
> > Can you post your Update Request Processor Chain?
> >
> >
> > -Scott
> >
> >
> > On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <
> > danielortegaufano@gmail.com>
> > wrote:
> >
> > > Hi Scott,
> > >
> > > - *Can you describe the process that queries the DB and sends records
> to
> > *
> > > *Solr?*
> > >
> > > We are enqueueing ids during every ORACLE transaction (in
> > insert/updates).
> > >
> > > An application dequeues every id and perform queries against dozen of
> > > tables in the relational model to retrieve the fields to build the
> > > document.  As we know that we are modifying the same ORACLE row in
> > > different (but consecutive) transactions, we store only the last
> version
> > of
> > > the modified documents in a map data structure.
> > >
> > > The application has a configurable interval to send the documents
> stored
> > in
> > > the map to the update handler (we have tested different intervals from
> > few
> > > milliseconds to several seconds) using the SolrJ client. Actually we
> are
> > > sending all the documents every 15 seconds.
> > >
> > > This application is developed using Java, Spring and Maven and we have
> > > several instances.
> > >
> > > -* Is it a SolrJ-based application?*
> > >
> > > Yes, it is. We aren't using the last version of SolrJ client (we are
> > > currently using SolrJ v6.3.0).
> > >
> > > - *If it is, which client package are you using?*
> > >
> > > I don't know exactly what do you mean saying 'client package' :)
> > >
> > > - *How many documents do you send at once?*
> > >
> > > It depends on the defined interval described before and the number of
> > > transactions executed in our relational database. From dozens to few
> > > hundreds (and even thousands).
> > >
> > > - *Are you sending your indexing or query traffic through a load
> > balancer?*
> > >
> > > We aren't using a load balancer for indexing, but we have all our Rest
> > > Query services through an HAProxy (using 'leastconn' algorithm). The
> Rest
> > > Query Services performs queries using the CloudSolrClient.
> > >
> > > Thanks for your reply,
> > > if you need any further information don't hesitate to ask
> > >
> > > Daniel
> > >
> > > 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@
> > opensourceconnections.com
> > > >:
> > >
> > > > Hi Daniel,
> > > >
> > > > Great background information about your setup! I've got just a few
> more
> > > > questions:
> > > >
> > > > - Can you describe the process that queries the DB and sends records
> to
> > > > Solr?
> > > > - Is it a SolrJ-based application?
> > > > - If it is, which client package are you using?
> > > > - How many documents do you send at once?
> > > > - Are you sending your indexing or query traffic through a load
> > balancer?
> > > >
> > > > If you're sending documents to each replica as fast as they can take
> > > them,
> > > > you might be seeing a bottleneck at the shard leaders. The SolrJ
> > > > CloudSolrClient finds out from Zookeeper which nodes are the shard
> > > leaders
> > > > and sends docs directly to them.
> > > >
> > > >
> > > > -Scott
> > > >
> > > > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> > > > danielortegaufano@gmail.com>
> > > > wrote:
> > > >
> > > > > *Main Problems*
> > > > >
> > > > >
> > > > > We are involved in a migration from Solr Master/Slave
> infrastructure
> > to
> > > > > SolrCloud infrastructure.
> > > > >
> > > > >
> > > > >
> > > > > The main problems that we have now are:
> > > > >
> > > > >
> > > > >
> > > > >    - Excessive resources consumption: Currently we have 5 instances
> > > with
> > > > 80
> > > > >    processors/768 GB RAM each instance using SSD Hard Disk Drives
> > that
> > > > > doesn't
> > > > >    support the load that we have in the other architecture. In our
> > > > >    Master-Slave architecture we have only 7 Virtual Machines with
> > lower
> > > > > specs
> > > > >    (4 processors and 16 GB each instance using SSD Hard Disk Drives
> > > too).
> > > > > So,
> > > > >    at the moment our SolrCloud infrastructure is wasting several
> > dozen
> > > > > times
> > > > >    more resources than our Solr Master/Slave infrastructure.
> > > > >    - Despite spending more resources we have worst query times
> > > (compared
> > > > to
> > > > >    Solr in master/slave architecture)
> > > > >
> > > > >
> > > > > *Search infrastructure (SolrCloud infrastructure)*
> > > > >
> > > > >
> > > > >
> > > > > As we cannot use DIH Handler (which is what we use in Solr
> > > Master/Slave),
> > > > > we
> > > > > have developed an application which reads every transaction from
> > > Oracle,
> > > > > builds a document collection searching in the database and sends
> the
> > > > result
> > > > > to the */update* handler every 200 milliseconds using SolrJ client.
> > > This
> > > > > application tries to delete the possible duplicates in each update
> > > > window,
> > > > > but we are using solr’s de-duplication techniques
> > > > > <https://emea01.safelinks.protection.outlook.com/?url=
> > > > > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > > > > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%
> > > > > 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> > > > > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> > > > > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
> > > > >  too.
> > > > >
> > > > >
> > > > >
> > > > > We are indexing ~100 documents per second (with peaks of ~1000
> > > documents
> > > > > per second).
> > > > >
> > > > >
> > > > >
> > > > > Every search query is centralized in other application which
> exposes
> > a
> > > > DSL
> > > > > behind a REST API and uses SolrJ client too to perform queries. We
> > have
> > > > > peaks of 2000 QPS.
> > > > >
> > > > > *Cluster structure **(SolrCloud infrastructure)*
> > > > >
> > > > >
> > > > >
> > > > > At the moment, the cluster has 30 SolrCloud instances with the same
> > > specs
> > > > > (Same physical hosts, same JVM Settings, etc.).
> > > > >
> > > > >
> > > > >
> > > > > *Main collection*
> > > > >
> > > > >
> > > > >
> > > > > In our use case we are using this collection as a NoSQL database
> > > > basically.
> > > > > Our document is composed of about 300 fields that represents an
> > advert,
> > > > and
> > > > > is a denormalization of its relational representation in Oracle.
> > > > >
> > > > >
> > > > > We are using all our nodes to store the  collection in 3 shards.
> So,
> > > each
> > > > > shard has 10 replicas.
> > > > >
> > > > >
> > > > > At the moment, we are only indexing a subset of the adverts stored
> in
> > > > > Oracle, but our goal is to store all the ads that we have in the DB
> > (a
> > > > few
> > > > > tens of millions of documents). We have NRT requirements, so we
> need
> > to
> > > > > index every document as soon as posible once it’s changed in
> Oracle.
> > > > >
> > > > >
> > > > >
> > > > > We have defined the properties of each field (if it’s
> stored/indexed
> > or
> > > > > not, if should be defined as DocValue, etc…) considering the use of
> > > that
> > > > > field.
> > > > >
> > > > >
> > > > >
> > > > > *Index size **(SolrCloud infrastructure)*
> > > > >
> > > > >
> > > > >
> > > > > The index size is currently above 6 GB, storing 1.300.000 documents
> > in
> > > > each
> > > > > shard. So, we are storing 3.900.000 documents and the total index
> > size
> > > is
> > > > > 18 GB.
> > > > >
> > > > >
> > > > >
> > > > > *Indexation **(SolrCloud infrastructure)*
> > > > >
> > > > >
> > > > >
> > > > > The commits *aren’t* triggered by the application described before.
> > The
> > > > > hardcommit/softcommit interval are configured in Solr:
> > > > >
> > > > >
> > > > >
> > > > >    - *HardCommit:* every 15 minutes (with opensearcher = false)
> > > > >    - *SoftCommit:* every 5 seconds
> > > > >
> > > > >
> > > > >
> > > > > *Apache Solr Version*
> > > > >
> > > > >
> > > > >
> > > > > We are currently using the last version of Solr (6.6.0) under an
> > Oracle
> > > > VM
> > > > > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64
> > > bits))
> > > > in
> > > > > both deployments.
> > > > >
> > > > >
> > > > > The question is... What is wrong here?!?!?!
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Scott Stults | Founder & Solutions Architect | OpenSource
> Connections,
> > > LLC
> > > > | 434.409.2780
> > > > http://www.opensourceconnections.com
> > > >
> > >
> >
> >
> >
> > --
> > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> LLC
> > | 434.409.2780
> > http://www.opensourceconnections.com
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Daniel Ortega <da...@gmail.com>.
Hi Scott,

In our indexing service we are using that client too
(org.apache.solr.client.solrj.impl.CloudSolrClient) :)

This is out Update Request Processor chain configuration:

<updateProcessor class="solr.processor.SignatureUpdateProcessorFactory" name
="signature"> <bool name="enabled">true</bool> <str name="signatureField">
hash</str> <bool name="overwriteDupes">false</bool> <str name=
"signatureClass">solr.processor.Lookup3Signature</str> </updateProcessor> <
updateRequestProcessorChain processor="signature" name="dedupe"> <processor
class="solr.LogUpdateProcessorFactory" /> <processor class=
"solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <!--
de-duplication process explained in:
https://cwiki.apache.org/confluence/display/solr/De-Duplication --> <
requestHandler name="/update" class="solr.UpdateRequestHandler" > <lst name=
"defaults"> <str name="update.chain">dedupe</str> </lst> </requestHandler>

Thanks for your reply :)

- Dani

2017-08-24 14:49 GMT+02:00 Scott Stults <ss...@opensourceconnections.com>:

> Hi Daniel,
>
> SolrJ has a few client implementations to choose from: CloudSolrClient,
> ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said your
> query service uses CloudSolrClient, but it would be good to verify which
> implementation your indexing service uses.
>
> One of the problems you might be having is with your deduplication step.
> Can you post your Update Request Processor Chain?
>
>
> -Scott
>
>
> On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <
> danielortegaufano@gmail.com>
> wrote:
>
> > Hi Scott,
> >
> > - *Can you describe the process that queries the DB and sends records to
> *
> > *Solr?*
> >
> > We are enqueueing ids during every ORACLE transaction (in
> insert/updates).
> >
> > An application dequeues every id and perform queries against dozen of
> > tables in the relational model to retrieve the fields to build the
> > document.  As we know that we are modifying the same ORACLE row in
> > different (but consecutive) transactions, we store only the last version
> of
> > the modified documents in a map data structure.
> >
> > The application has a configurable interval to send the documents stored
> in
> > the map to the update handler (we have tested different intervals from
> few
> > milliseconds to several seconds) using the SolrJ client. Actually we are
> > sending all the documents every 15 seconds.
> >
> > This application is developed using Java, Spring and Maven and we have
> > several instances.
> >
> > -* Is it a SolrJ-based application?*
> >
> > Yes, it is. We aren't using the last version of SolrJ client (we are
> > currently using SolrJ v6.3.0).
> >
> > - *If it is, which client package are you using?*
> >
> > I don't know exactly what do you mean saying 'client package' :)
> >
> > - *How many documents do you send at once?*
> >
> > It depends on the defined interval described before and the number of
> > transactions executed in our relational database. From dozens to few
> > hundreds (and even thousands).
> >
> > - *Are you sending your indexing or query traffic through a load
> balancer?*
> >
> > We aren't using a load balancer for indexing, but we have all our Rest
> > Query services through an HAProxy (using 'leastconn' algorithm). The Rest
> > Query Services performs queries using the CloudSolrClient.
> >
> > Thanks for your reply,
> > if you need any further information don't hesitate to ask
> >
> > Daniel
> >
> > 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@
> opensourceconnections.com
> > >:
> >
> > > Hi Daniel,
> > >
> > > Great background information about your setup! I've got just a few more
> > > questions:
> > >
> > > - Can you describe the process that queries the DB and sends records to
> > > Solr?
> > > - Is it a SolrJ-based application?
> > > - If it is, which client package are you using?
> > > - How many documents do you send at once?
> > > - Are you sending your indexing or query traffic through a load
> balancer?
> > >
> > > If you're sending documents to each replica as fast as they can take
> > them,
> > > you might be seeing a bottleneck at the shard leaders. The SolrJ
> > > CloudSolrClient finds out from Zookeeper which nodes are the shard
> > leaders
> > > and sends docs directly to them.
> > >
> > >
> > > -Scott
> > >
> > > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> > > danielortegaufano@gmail.com>
> > > wrote:
> > >
> > > > *Main Problems*
> > > >
> > > >
> > > > We are involved in a migration from Solr Master/Slave infrastructure
> to
> > > > SolrCloud infrastructure.
> > > >
> > > >
> > > >
> > > > The main problems that we have now are:
> > > >
> > > >
> > > >
> > > >    - Excessive resources consumption: Currently we have 5 instances
> > with
> > > 80
> > > >    processors/768 GB RAM each instance using SSD Hard Disk Drives
> that
> > > > doesn't
> > > >    support the load that we have in the other architecture. In our
> > > >    Master-Slave architecture we have only 7 Virtual Machines with
> lower
> > > > specs
> > > >    (4 processors and 16 GB each instance using SSD Hard Disk Drives
> > too).
> > > > So,
> > > >    at the moment our SolrCloud infrastructure is wasting several
> dozen
> > > > times
> > > >    more resources than our Solr Master/Slave infrastructure.
> > > >    - Despite spending more resources we have worst query times
> > (compared
> > > to
> > > >    Solr in master/slave architecture)
> > > >
> > > >
> > > > *Search infrastructure (SolrCloud infrastructure)*
> > > >
> > > >
> > > >
> > > > As we cannot use DIH Handler (which is what we use in Solr
> > Master/Slave),
> > > > we
> > > > have developed an application which reads every transaction from
> > Oracle,
> > > > builds a document collection searching in the database and sends the
> > > result
> > > > to the */update* handler every 200 milliseconds using SolrJ client.
> > This
> > > > application tries to delete the possible duplicates in each update
> > > window,
> > > > but we are using solr’s de-duplication techniques
> > > > <https://emea01.safelinks.protection.outlook.com/?url=
> > > > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > > > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%
> > > > 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> > > > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> > > > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
> > > >  too.
> > > >
> > > >
> > > >
> > > > We are indexing ~100 documents per second (with peaks of ~1000
> > documents
> > > > per second).
> > > >
> > > >
> > > >
> > > > Every search query is centralized in other application which exposes
> a
> > > DSL
> > > > behind a REST API and uses SolrJ client too to perform queries. We
> have
> > > > peaks of 2000 QPS.
> > > >
> > > > *Cluster structure **(SolrCloud infrastructure)*
> > > >
> > > >
> > > >
> > > > At the moment, the cluster has 30 SolrCloud instances with the same
> > specs
> > > > (Same physical hosts, same JVM Settings, etc.).
> > > >
> > > >
> > > >
> > > > *Main collection*
> > > >
> > > >
> > > >
> > > > In our use case we are using this collection as a NoSQL database
> > > basically.
> > > > Our document is composed of about 300 fields that represents an
> advert,
> > > and
> > > > is a denormalization of its relational representation in Oracle.
> > > >
> > > >
> > > > We are using all our nodes to store the  collection in 3 shards. So,
> > each
> > > > shard has 10 replicas.
> > > >
> > > >
> > > > At the moment, we are only indexing a subset of the adverts stored in
> > > > Oracle, but our goal is to store all the ads that we have in the DB
> (a
> > > few
> > > > tens of millions of documents). We have NRT requirements, so we need
> to
> > > > index every document as soon as posible once it’s changed in Oracle.
> > > >
> > > >
> > > >
> > > > We have defined the properties of each field (if it’s stored/indexed
> or
> > > > not, if should be defined as DocValue, etc…) considering the use of
> > that
> > > > field.
> > > >
> > > >
> > > >
> > > > *Index size **(SolrCloud infrastructure)*
> > > >
> > > >
> > > >
> > > > The index size is currently above 6 GB, storing 1.300.000 documents
> in
> > > each
> > > > shard. So, we are storing 3.900.000 documents and the total index
> size
> > is
> > > > 18 GB.
> > > >
> > > >
> > > >
> > > > *Indexation **(SolrCloud infrastructure)*
> > > >
> > > >
> > > >
> > > > The commits *aren’t* triggered by the application described before.
> The
> > > > hardcommit/softcommit interval are configured in Solr:
> > > >
> > > >
> > > >
> > > >    - *HardCommit:* every 15 minutes (with opensearcher = false)
> > > >    - *SoftCommit:* every 5 seconds
> > > >
> > > >
> > > >
> > > > *Apache Solr Version*
> > > >
> > > >
> > > >
> > > > We are currently using the last version of Solr (6.6.0) under an
> Oracle
> > > VM
> > > > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64
> > bits))
> > > in
> > > > both deployments.
> > > >
> > > >
> > > > The question is... What is wrong here?!?!?!
> > > >
> > >
> > >
> > >
> > > --
> > > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> > LLC
> > > | 434.409.2780
> > > http://www.opensourceconnections.com
> > >
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Scott Stults <ss...@opensourceconnections.com>.
Hi Daniel,

SolrJ has a few client implementations to choose from: CloudSolrClient,
ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said your
query service uses CloudSolrClient, but it would be good to verify which
implementation your indexing service uses.

One of the problems you might be having is with your deduplication step.
Can you post your Update Request Processor Chain?


-Scott


On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <da...@gmail.com>
wrote:

> Hi Scott,
>
> - *Can you describe the process that queries the DB and sends records to *
> *Solr?*
>
> We are enqueueing ids during every ORACLE transaction (in insert/updates).
>
> An application dequeues every id and perform queries against dozen of
> tables in the relational model to retrieve the fields to build the
> document.  As we know that we are modifying the same ORACLE row in
> different (but consecutive) transactions, we store only the last version of
> the modified documents in a map data structure.
>
> The application has a configurable interval to send the documents stored in
> the map to the update handler (we have tested different intervals from few
> milliseconds to several seconds) using the SolrJ client. Actually we are
> sending all the documents every 15 seconds.
>
> This application is developed using Java, Spring and Maven and we have
> several instances.
>
> -* Is it a SolrJ-based application?*
>
> Yes, it is. We aren't using the last version of SolrJ client (we are
> currently using SolrJ v6.3.0).
>
> - *If it is, which client package are you using?*
>
> I don't know exactly what do you mean saying 'client package' :)
>
> - *How many documents do you send at once?*
>
> It depends on the defined interval described before and the number of
> transactions executed in our relational database. From dozens to few
> hundreds (and even thousands).
>
> - *Are you sending your indexing or query traffic through a load balancer?*
>
> We aren't using a load balancer for indexing, but we have all our Rest
> Query services through an HAProxy (using 'leastconn' algorithm). The Rest
> Query Services performs queries using the CloudSolrClient.
>
> Thanks for your reply,
> if you need any further information don't hesitate to ask
>
> Daniel
>
> 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@opensourceconnections.com
> >:
>
> > Hi Daniel,
> >
> > Great background information about your setup! I've got just a few more
> > questions:
> >
> > - Can you describe the process that queries the DB and sends records to
> > Solr?
> > - Is it a SolrJ-based application?
> > - If it is, which client package are you using?
> > - How many documents do you send at once?
> > - Are you sending your indexing or query traffic through a load balancer?
> >
> > If you're sending documents to each replica as fast as they can take
> them,
> > you might be seeing a bottleneck at the shard leaders. The SolrJ
> > CloudSolrClient finds out from Zookeeper which nodes are the shard
> leaders
> > and sends docs directly to them.
> >
> >
> > -Scott
> >
> > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> > danielortegaufano@gmail.com>
> > wrote:
> >
> > > *Main Problems*
> > >
> > >
> > > We are involved in a migration from Solr Master/Slave infrastructure to
> > > SolrCloud infrastructure.
> > >
> > >
> > >
> > > The main problems that we have now are:
> > >
> > >
> > >
> > >    - Excessive resources consumption: Currently we have 5 instances
> with
> > 80
> > >    processors/768 GB RAM each instance using SSD Hard Disk Drives that
> > > doesn't
> > >    support the load that we have in the other architecture. In our
> > >    Master-Slave architecture we have only 7 Virtual Machines with lower
> > > specs
> > >    (4 processors and 16 GB each instance using SSD Hard Disk Drives
> too).
> > > So,
> > >    at the moment our SolrCloud infrastructure is wasting several dozen
> > > times
> > >    more resources than our Solr Master/Slave infrastructure.
> > >    - Despite spending more resources we have worst query times
> (compared
> > to
> > >    Solr in master/slave architecture)
> > >
> > >
> > > *Search infrastructure (SolrCloud infrastructure)*
> > >
> > >
> > >
> > > As we cannot use DIH Handler (which is what we use in Solr
> Master/Slave),
> > > we
> > > have developed an application which reads every transaction from
> Oracle,
> > > builds a document collection searching in the database and sends the
> > result
> > > to the */update* handler every 200 milliseconds using SolrJ client.
> This
> > > application tries to delete the possible duplicates in each update
> > window,
> > > but we are using solr’s de-duplication techniques
> > > <https://emea01.safelinks.protection.outlook.com/?url=
> > > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%
> > > 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> > > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> > > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
> > >  too.
> > >
> > >
> > >
> > > We are indexing ~100 documents per second (with peaks of ~1000
> documents
> > > per second).
> > >
> > >
> > >
> > > Every search query is centralized in other application which exposes a
> > DSL
> > > behind a REST API and uses SolrJ client too to perform queries. We have
> > > peaks of 2000 QPS.
> > >
> > > *Cluster structure **(SolrCloud infrastructure)*
> > >
> > >
> > >
> > > At the moment, the cluster has 30 SolrCloud instances with the same
> specs
> > > (Same physical hosts, same JVM Settings, etc.).
> > >
> > >
> > >
> > > *Main collection*
> > >
> > >
> > >
> > > In our use case we are using this collection as a NoSQL database
> > basically.
> > > Our document is composed of about 300 fields that represents an advert,
> > and
> > > is a denormalization of its relational representation in Oracle.
> > >
> > >
> > > We are using all our nodes to store the  collection in 3 shards. So,
> each
> > > shard has 10 replicas.
> > >
> > >
> > > At the moment, we are only indexing a subset of the adverts stored in
> > > Oracle, but our goal is to store all the ads that we have in the DB (a
> > few
> > > tens of millions of documents). We have NRT requirements, so we need to
> > > index every document as soon as posible once it’s changed in Oracle.
> > >
> > >
> > >
> > > We have defined the properties of each field (if it’s stored/indexed or
> > > not, if should be defined as DocValue, etc…) considering the use of
> that
> > > field.
> > >
> > >
> > >
> > > *Index size **(SolrCloud infrastructure)*
> > >
> > >
> > >
> > > The index size is currently above 6 GB, storing 1.300.000 documents in
> > each
> > > shard. So, we are storing 3.900.000 documents and the total index size
> is
> > > 18 GB.
> > >
> > >
> > >
> > > *Indexation **(SolrCloud infrastructure)*
> > >
> > >
> > >
> > > The commits *aren’t* triggered by the application described before. The
> > > hardcommit/softcommit interval are configured in Solr:
> > >
> > >
> > >
> > >    - *HardCommit:* every 15 minutes (with opensearcher = false)
> > >    - *SoftCommit:* every 5 seconds
> > >
> > >
> > >
> > > *Apache Solr Version*
> > >
> > >
> > >
> > > We are currently using the last version of Solr (6.6.0) under an Oracle
> > VM
> > > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64
> bits))
> > in
> > > both deployments.
> > >
> > >
> > > The question is... What is wrong here?!?!?!
> > >
> >
> >
> >
> > --
> > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> LLC
> > | 434.409.2780
> > http://www.opensourceconnections.com
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Daniel Ortega <da...@gmail.com>.
Hi Scott,

- *Can you describe the process that queries the DB and sends records to *
*Solr?*

We are enqueueing ids during every ORACLE transaction (in insert/updates).

An application dequeues every id and perform queries against dozen of
tables in the relational model to retrieve the fields to build the
document.  As we know that we are modifying the same ORACLE row in
different (but consecutive) transactions, we store only the last version of
the modified documents in a map data structure.

The application has a configurable interval to send the documents stored in
the map to the update handler (we have tested different intervals from few
milliseconds to several seconds) using the SolrJ client. Actually we are
sending all the documents every 15 seconds.

This application is developed using Java, Spring and Maven and we have
several instances.

-* Is it a SolrJ-based application?*

Yes, it is. We aren't using the last version of SolrJ client (we are
currently using SolrJ v6.3.0).

- *If it is, which client package are you using?*

I don't know exactly what do you mean saying 'client package' :)

- *How many documents do you send at once?*

It depends on the defined interval described before and the number of
transactions executed in our relational database. From dozens to few
hundreds (and even thousands).

- *Are you sending your indexing or query traffic through a load balancer?*

We aren't using a load balancer for indexing, but we have all our Rest
Query services through an HAProxy (using 'leastconn' algorithm). The Rest
Query Services performs queries using the CloudSolrClient.

Thanks for your reply,
if you need any further information don't hesitate to ask

Daniel

2017-08-23 14:57 GMT+02:00 Scott Stults <ss...@opensourceconnections.com>:

> Hi Daniel,
>
> Great background information about your setup! I've got just a few more
> questions:
>
> - Can you describe the process that queries the DB and sends records to
> Solr?
> - Is it a SolrJ-based application?
> - If it is, which client package are you using?
> - How many documents do you send at once?
> - Are you sending your indexing or query traffic through a load balancer?
>
> If you're sending documents to each replica as fast as they can take them,
> you might be seeing a bottleneck at the shard leaders. The SolrJ
> CloudSolrClient finds out from Zookeeper which nodes are the shard leaders
> and sends docs directly to them.
>
>
> -Scott
>
> On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> danielortegaufano@gmail.com>
> wrote:
>
> > *Main Problems*
> >
> >
> > We are involved in a migration from Solr Master/Slave infrastructure to
> > SolrCloud infrastructure.
> >
> >
> >
> > The main problems that we have now are:
> >
> >
> >
> >    - Excessive resources consumption: Currently we have 5 instances with
> 80
> >    processors/768 GB RAM each instance using SSD Hard Disk Drives that
> > doesn't
> >    support the load that we have in the other architecture. In our
> >    Master-Slave architecture we have only 7 Virtual Machines with lower
> > specs
> >    (4 processors and 16 GB each instance using SSD Hard Disk Drives too).
> > So,
> >    at the moment our SolrCloud infrastructure is wasting several dozen
> > times
> >    more resources than our Solr Master/Slave infrastructure.
> >    - Despite spending more resources we have worst query times (compared
> to
> >    Solr in master/slave architecture)
> >
> >
> > *Search infrastructure (SolrCloud infrastructure)*
> >
> >
> >
> > As we cannot use DIH Handler (which is what we use in Solr Master/Slave),
> > we
> > have developed an application which reads every transaction from Oracle,
> > builds a document collection searching in the database and sends the
> result
> > to the */update* handler every 200 milliseconds using SolrJ client. This
> > application tries to delete the possible duplicates in each update
> window,
> > but we are using solr’s de-duplication techniques
> > <https://emea01.safelinks.protection.outlook.com/?url=
> > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%
> > 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
> >  too.
> >
> >
> >
> > We are indexing ~100 documents per second (with peaks of ~1000 documents
> > per second).
> >
> >
> >
> > Every search query is centralized in other application which exposes a
> DSL
> > behind a REST API and uses SolrJ client too to perform queries. We have
> > peaks of 2000 QPS.
> >
> > *Cluster structure **(SolrCloud infrastructure)*
> >
> >
> >
> > At the moment, the cluster has 30 SolrCloud instances with the same specs
> > (Same physical hosts, same JVM Settings, etc.).
> >
> >
> >
> > *Main collection*
> >
> >
> >
> > In our use case we are using this collection as a NoSQL database
> basically.
> > Our document is composed of about 300 fields that represents an advert,
> and
> > is a denormalization of its relational representation in Oracle.
> >
> >
> > We are using all our nodes to store the  collection in 3 shards. So, each
> > shard has 10 replicas.
> >
> >
> > At the moment, we are only indexing a subset of the adverts stored in
> > Oracle, but our goal is to store all the ads that we have in the DB (a
> few
> > tens of millions of documents). We have NRT requirements, so we need to
> > index every document as soon as posible once it’s changed in Oracle.
> >
> >
> >
> > We have defined the properties of each field (if it’s stored/indexed or
> > not, if should be defined as DocValue, etc…) considering the use of that
> > field.
> >
> >
> >
> > *Index size **(SolrCloud infrastructure)*
> >
> >
> >
> > The index size is currently above 6 GB, storing 1.300.000 documents in
> each
> > shard. So, we are storing 3.900.000 documents and the total index size is
> > 18 GB.
> >
> >
> >
> > *Indexation **(SolrCloud infrastructure)*
> >
> >
> >
> > The commits *aren’t* triggered by the application described before. The
> > hardcommit/softcommit interval are configured in Solr:
> >
> >
> >
> >    - *HardCommit:* every 15 minutes (with opensearcher = false)
> >    - *SoftCommit:* every 5 seconds
> >
> >
> >
> > *Apache Solr Version*
> >
> >
> >
> > We are currently using the last version of Solr (6.6.0) under an Oracle
> VM
> > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64 bits))
> in
> > both deployments.
> >
> >
> > The question is... What is wrong here?!?!?!
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

Posted by Scott Stults <ss...@opensourceconnections.com>.
Hi Daniel,

Great background information about your setup! I've got just a few more
questions:

- Can you describe the process that queries the DB and sends records to
Solr?
- Is it a SolrJ-based application?
- If it is, which client package are you using?
- How many documents do you send at once?
- Are you sending your indexing or query traffic through a load balancer?

If you're sending documents to each replica as fast as they can take them,
you might be seeing a bottleneck at the shard leaders. The SolrJ
CloudSolrClient finds out from Zookeeper which nodes are the shard leaders
and sends docs directly to them.


-Scott

On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <da...@gmail.com>
wrote:

> *Main Problems*
>
>
> We are involved in a migration from Solr Master/Slave infrastructure to
> SolrCloud infrastructure.
>
>
>
> The main problems that we have now are:
>
>
>
>    - Excessive resources consumption: Currently we have 5 instances with 80
>    processors/768 GB RAM each instance using SSD Hard Disk Drives that
> doesn't
>    support the load that we have in the other architecture. In our
>    Master-Slave architecture we have only 7 Virtual Machines with lower
> specs
>    (4 processors and 16 GB each instance using SSD Hard Disk Drives too).
> So,
>    at the moment our SolrCloud infrastructure is wasting several dozen
> times
>    more resources than our Solr Master/Slave infrastructure.
>    - Despite spending more resources we have worst query times (compared to
>    Solr in master/slave architecture)
>
>
> *Search infrastructure (SolrCloud infrastructure)*
>
>
>
> As we cannot use DIH Handler (which is what we use in Solr Master/Slave),
> we
> have developed an application which reads every transaction from Oracle,
> builds a document collection searching in the database and sends the result
> to the */update* handler every 200 milliseconds using SolrJ client. This
> application tries to delete the possible duplicates in each update window,
> but we are using solr’s de-duplication techniques
> <https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com%
> 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706%
> 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0>
>  too.
>
>
>
> We are indexing ~100 documents per second (with peaks of ~1000 documents
> per second).
>
>
>
> Every search query is centralized in other application which exposes a DSL
> behind a REST API and uses SolrJ client too to perform queries. We have
> peaks of 2000 QPS.
>
> *Cluster structure **(SolrCloud infrastructure)*
>
>
>
> At the moment, the cluster has 30 SolrCloud instances with the same specs
> (Same physical hosts, same JVM Settings, etc.).
>
>
>
> *Main collection*
>
>
>
> In our use case we are using this collection as a NoSQL database basically.
> Our document is composed of about 300 fields that represents an advert, and
> is a denormalization of its relational representation in Oracle.
>
>
> We are using all our nodes to store the  collection in 3 shards. So, each
> shard has 10 replicas.
>
>
> At the moment, we are only indexing a subset of the adverts stored in
> Oracle, but our goal is to store all the ads that we have in the DB (a few
> tens of millions of documents). We have NRT requirements, so we need to
> index every document as soon as posible once it’s changed in Oracle.
>
>
>
> We have defined the properties of each field (if it’s stored/indexed or
> not, if should be defined as DocValue, etc…) considering the use of that
> field.
>
>
>
> *Index size **(SolrCloud infrastructure)*
>
>
>
> The index size is currently above 6 GB, storing 1.300.000 documents in each
> shard. So, we are storing 3.900.000 documents and the total index size is
> 18 GB.
>
>
>
> *Indexation **(SolrCloud infrastructure)*
>
>
>
> The commits *aren’t* triggered by the application described before. The
> hardcommit/softcommit interval are configured in Solr:
>
>
>
>    - *HardCommit:* every 15 minutes (with opensearcher = false)
>    - *SoftCommit:* every 5 seconds
>
>
>
> *Apache Solr Version*
>
>
>
> We are currently using the last version of Solr (6.6.0) under an Oracle VM
> (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64 bits)) in
> both deployments.
>
>
> The question is... What is wrong here?!?!?!
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com